



UBRARY 

Cl.No.C>2-<?ioS 

Ac, No. y\ 6 release for loan 

This book should be returned on or before the date last stamped 
below. An overdue charge of one anna will be levied for each day 
the book is kept beyond that date. 






Psychological Statistics 



WILEY PUBLICATIONS IN PSYCHOLOGY 


FOUNDATIONS OF PSYCHOLOGY 

By Edwin G. Boring, Herbert S. Langfeld, and Harry P. Weld 
INTRODUCTION TO PSYCHOLOGY 

By Edwin G. Boring, Herbert S. Langfeld, and Harry P. Weld 
SOCIAL PSYCHOLOGY 

By Daniel Katz and Richard L. Schanck 
HEARING—ITS PSYCHOLOGY AND PHYSIOLOGY 
By S. Smith Stevens and Hallowell Davis 
MANUAL OF PSYCHIATRY AND MENTAL HYGIENE. Seventh Edition 
By Aaron J. Rosanopp 

STATISTICAL METHODS IN BIOLOGY, MEDICINE, AND PSYCHOL- 
OGY. Fourth Edition 
By C. B. Davenport and Merle P. Ekas 
MOTIVATION OF BEHAVIOR 
By P. T. Young 

PSYCHOLOGY—A FACTUAL TEXTBOOK 

By Edwin G. Boring, Herbert S. Langfeld, and Harry P. Weld 
PSYCHOLOGY IN BUSINESS AND INDUSTRY 
By John G. Jenkins 


HERBERT S. LANGFELD 
Advisory Editor 

APPLIED EXPERIMENTAL PSYCHOLOGY 

By Alphonse Chapanis, W. R. Garner, and C. T. Morgan 
THEORY OF HEARING 
By Ernest Glen Wever 
PSYCHOLOGICAL STATISTICS 
By Quinn McNemar 
METHODS OF PSYCHOLOGY 
T. G. Andrews, Editor 

THE PSYCHOLOGY OF EGO-INVOLVEMENTS 
By Muzaper Sherip and Hadley Cantbil 
MANUAL OF CHILD PSYCHOLOGY 
Leonard Carmichael, Editor 
EMOTION IN MAN AND ANIMAL 
By P. T. Young 
UNCONSCIOUSNESS 
By James Grier Miller 

THE PSYCHOLOGY OF PERSONAL ADJUSTMENT. Second Edition 
By Fred McKinney 

THE PSYCHOLOGY OF SOCIAL MOVEMENTS 
By Hadley Cantbil 



Psychological 

Statistics 


by QUINN McNEMAR 

PROFESSOR OF PSYCHOLOGY, STATISTICS, 
AND EDUCATION 
STANFORD UNIVERSITY 


1949 

JOHN WILEY & SONS, INC., NEW YORK 
CHAPMAN & HALL, LTD., LONDON 



Copyright, 1949 , 

BY 

John Wiley & Sons, Inc. 


All Rights Reserved 

This })ook or any part thereof must not 
he reproduced in any form wUhovi 
the written permission of the publisher. 


THIRD PRINTING, NOVEMBER, 1949 


PRINTED IN THE UNITED STATES OF AMERICA 



Preface 


During fifteen years of teaching introductory and advanced 
statistics I gradually developed several chapters of systematic 
syllabus as a supplement to class lectures and in lieu of a text. 
This syllabus has been so well received by my students that I 
decided to revise and expand it into textbook form. The result 
may be of service to instructors who have felt the need for a text 
that would serve as a concise introduction to statistical methods 
and provide a continuous transition to more and more advanced 
topics. The material herein should be adequate for a year’s 
course. 

I shall not attempt to defend the inclusion of some topics and 
the exclusion of others. It is my belief that the more frequently 
useful techniques have been included. Nor do I defend the inclu¬ 
sion of elementary mathematical derivations beyond saying that 
I think it fruitless to try explaining statistics to those who are not 
prepared to do some thinking in mathematical language. 

Some difference of opinion exists as to the optimal order for 
presenting sampling and correlation. I prefer to introduce early 
the problem of sampling and statistical inference. Admittedly, 
a discussion of sampling without prior knowledge of correlation 
has its awkward moments when one is dealing with the sampling 
error of the difference between correlated values. Practically all 
such sampling errors can be handled by way of differences, with 
brief mention of correlation. Moreover, I find that the meaning of 
correlation can best be explained after the student has knowledge 
of sampling errors. 

Whatever merits this text possesses will not be enhanced by 
what is said in the preface. It is my hope that Chapters 5 and 16 
surpass the usual treatment of sampling, and that a useful pur¬ 
pose is served by the bringing together in one chapter (8) of most 
of the factors which need consideration in correlational analysis. 
I have made an effort to delineate clearly the meaning of the 
analysis of variance, but whether successfully is for the reader 
to decide. 



VI 


Preface 


It is impossible to disentangle and acknowledge all the factors 
that have contributed to the content and the writing of this vol¬ 
ume. My contemporaries will perhaps recognize the influence of 
two of my teachers, Professors Truman L. Kelley and Harold 
Hotelling. I am appreciative of the careful work of a former 
student, Margaret Soares, who undertook the tedious task of 
typing the final copy. The greatest personal indebtedness is to 
Olga W. McNemar, whose critical editing did much to clarify 
the exposition and to rid the manuscript of errors. 

I am indebted to Professor Ronald A. Fisher and Dr. Frank 
Yates, also to Messrs. Oliver & Boyd Limited, Edinburgh, for 
permission to reprint Tables Nos. Ill, IV, V, and VII from their 
book “Statistical Tables for Biological, Agricultural and Medical 
Research.’’ 

Quinn McNemar 

Palo Alto 
July, 1948 



Contents 


CHAPTER 

1. Introduction 1 

2. Tabular and graphic methods 5 

3. Describing frequency distributions 12 

4. The normal curve and probability 31 

5. Sampling errors and statistical inference 46 

6. Correlation: Introduction and computation 90 

7. Correlation: Interpretations and assumptions 99 

8. Factors which affect the correlation coefficient 121 

9. Multiple correlation 144 

10. Other correlation methods 166 

11. Frequency comparison: Chi square 186 

12. Small sample methods 216 

13. Analysis of variance: Sinjple 235 

14. Analysis of variance: Complex 267 

15. Analysis of variance: Covariance method 318 

16. Notes on sampling and statistical inference 331 

Appendix (Tables) 346 

Index 357 


vU 




Introduction 


CHAPTER 1 


Statistical methods are concerned with the reducing of either 
large or small masses of data to a few convenient descriptive terms 
and with the drawing of inferences therefrom. The data are col¬ 
lected by any of several methods of research with the aid of meas¬ 
uring devices appropriate to a given area of investigation. The 
research methods are variously named and classified. Thus in 
psychology we have methods which are labeled experimental, 
clinical, observational, etc. The devices for measuring or securing 
responses vary from those which involve delicate apparatus through 
paper-and-pencil schemes to controlled observations and inter¬ 
views. Statistical techniques are not to be considered as coordi¬ 
nate either with research methods or with devices for obtaining 
and recording responses, but rather as tools for analyzing data 
collected by whatever meaAS. 

The reduction of a batch of data to a few descriptive measures 
is the part of statistical analysis which should lead one to a better 
over-all comprehension of the data. All readers will be more or 
less familiar with the concept of average. An average is a measure 
which describes what is typical of a group with respect to some 
trait, characteristic, or variable. If one is comparmg two or more 
groups, the determination of an average for each group permits a 
better appraisal of possible group differences than would be ob¬ 
tained by casual examination of the data. There are various 
statistical measures, or types of averages, which have proven 
useful as descriptive terms for a variety of data. One aim of this 
book is to present and discuss the descriptive statistical measures 
most frequently needed in psychological research. Proper usage 
and interpretation of these terms and evaluation of their use by 
others are not possible without knowledge of their meaning and 



2 


Introduction 


their limiting assumptions. Incidentally, the user of statistical 
measures must give some thought to computational procedures. 

That part of statistical analysis which has to do with the draw¬ 
ing of inferences is imposed upon us because of certain inadequacies 
of research data. For instance, an investigator who wishes to 
know the average height of adult women in the United States will 
never have facilities for measuring every woman. Accordingly, 
he is compelled to measure a sample of women; then on the basis 
of information jdelded by the sample he can make an infer¬ 
ence concerning the average height of the population of women. 
Another investigator, wishing to evaluate the relative merits of 
two learning methods, tries out the methods with two small groups 
of students, and from the results an inference is made concerning 
what might be expected if he had facilities for working with very 
large groups. An opinion poller may seek information about the 
reactions of Republicans and Democrats to some world event. 
By questioning a sample of each group he can secure sufficient 
data for drawing an inference regarding a possible difference be¬ 
tween the population of Republicans and the population of Demo¬ 
crats. 

The problem of statistical inference is usually that of deter¬ 
mining whether statistical significance can be attached to results 
after due allowance is made for known sources of error. There 
are many and varied situations for which we need tests of signifi¬ 
cance, and accordingly several tests are available. Intelligent 
and critical inferences cannot be made by those who do not under¬ 
stand the purposes, assumptions, and applicability of the various 
techniques for judging significance. 

It is in connection with the problem of drawing inferences that 
a knowledge of statistical methods is most helpful. A research 
should be planned in such a way that the resulting data are amen¬ 
able to treatment by the available statistical techniques. With 
sufficient information concerning these techniques of analysis, one 
should be able to lay out in advance of data collecting the main 
types of statistical analysis to be used. If a proposed experi¬ 
mental setup precludes the possibility of adequate analysis, it 
may be found that a slight alteration in the plan will remedy the 
atuation. All too frequently the statistician is called in to help 
with data which have not been collected in such a manner as to 
permit efficient analysis. Only by knowing the available methods 



Introduction 3 

of analysis can one plan a research with assurance that the results 
can be handled statistically. 

Another reason for keeping in mind statistical considerations 
while planning a research is the fact that some experimental de¬ 
signs are preferable because they permit, with small additional 
cost, or even at a saving, better control of error than other plans. 
Indeed, certain designs lead to a marked reduction in known 
sources of error. 

A third reason for planning with foresight regarding the statis¬ 
tical analysis is that a set of data can sometimes be made to serve 
for checking several different hypotheses. 

The student should be warned that he cannot expect miracles 
to be wrought by the use of statistical tools. Although statistical 
methods have an important place in present-day psychological 
research, it does not follow that they can be utilized to salvage 
data that result from a haphazardly planned and sloppily executed 
investigation. No amount of statistical juggling can transfigure 
bad data into acceptable form. It is doubtful whether the student 
who comes to the statistician with a batch of data and the question, 
^^Can I compute a correlation coefficient . . . will make a 
scientific contribution, but such a student deserves sympathy, 
especially if his major advisor has suggested that he need not 
worry about statistics until he has collected data. 

The purpose of the present book is to acquaint the student with 
the statistical techniques commonly used, to suggest economical 
computational procedures, and to state the assumptions and limi¬ 
tations of the various techniques. Whenever the understanding 
of a particular technique can be clarified by a simple derivation, 
such a derivation will be given. Unfortunately, many of the 
derivations are too complicated mathematically to permit con¬ 
sideration in an elementary or intermediate treatment. The 
qualified and interested student will find some of these derivations 
in more advanced textbooks and others in original sources. 

Statistical methods belong in the realm of applied mathematics, 
and consequently extensive scholarship in mathematics is required 
of those who choose to specialize in statistics. One can, however, 
secure a practical working knowledge of statistical techniques 
without first becoming a mathematician, providing his deficiency 
in mathematics is not accompanied by an emotional reaction to 
S3mibols. 



4 


Introduction 


Within the realm of psychological research there is wide varia¬ 
tion in the need for statistical procedures. One can find current 
research reports which involve no use of statistics, some which 
involve very simple statistical treatment, still others which lean 
heavily on the tools of statistics, and a few which are highly statis¬ 
tical. One need not shift from one area of investigation to another 
to find this variation, but it is true that certain areas of research 
in psychology have less dependency than others on statistical pro¬ 
cedures. The area of psychology which seems most dependent 
upon statistics is psychological measurement. This dependency 
is due mainly to the very nature of psychological measurement, 
the theory of which is largely statistical. 

The presence or absence of statistical analysis per se is not a 
safe criterion for judging the worth of a study—some studies 
would have been improved by the utilization of statistics, whereas 
others would be better if they had been so designed as to depend 
less upon statistical analysis. Except for the requirement that 
the statistical analysis be adequate, there are no general rules as 
to how statistical a research should be. Of two experimental 
plans, either of which would provide appropriate data for check¬ 
ing a given hypothesis or sets of hypotheses, that plan which calls 
for simple statistical analysis is certainly preferable to the one 
which requires elaborate analysis. Experimental control of errors 
is far better than statistical adjustments. 



CHAPTER 


Tabular and Graphic Methods 


When we are faced with a mass of data, the first manipulative 
step is tabulation or classification. If we are dealing with the 
number of children per family, the tabulation is equivalent to 
counting the number of one-child families, two-child families, etc.; 
or if we have information on 1000 persons regarding their national 
origin, we can tabulate, or count, the number of those of German, 
French, Italian, etc., origin; or these same individuals can be 
classified as to eye color. If we have their heights, we can also 
classify (or tabulate) them as being 68, 59, 60, etc., inches in height, 
and if the shortest person is 58 and the tallest is 78 inches, we 
would tabulate our 1000 into 21 different inch groups. If we also 
know the weights of these individuals, we can classify again, this 
time as 100, 101, up to (say) 229 pounds, and thereby have 130 
groups. In all these situations we can classify with respect to the 
given characteristics, but the resulting tabulations will show marked 
differences as we pass from trait to trait. For instance, we may 
have only six national groups, and it will make little difference 
whether Germans or Russians are first on the tabulation sheet. 
Such a characteristic as nationality or eye color is said to be unor¬ 
dered (and somewhat discrete). The number of children per family 
is discrete but can be ordered, from least to greatest number. 
Now such a trait as height can also be ordered, but it is said to be 
continuous (nondiscrete) because it is possible to have an infinite 
number of in-between values very closely spaced. Such a series 
is sometimes called graduated. It will of course be obvious that a 
discrete series does not permit of in-between values, e.g., no family 
can have 2 J children. 

For most purposes it is adequate if we tabulate, or classify, 
individuals into certain large groups. For example, instead of 
classifying our 1000 persons into pound groups (130 such groups) 

5 



6 Tabular and Graphic Methods 

it is usually sufficient to classify them into broader groups, say 
100-109, 110-119, etc., thereby obtaining 13 large groups. As a 
matter of fact, the use of fewer groups has a distinct advantage 
in that the labor of tabulating and computing descriptive terms 
is greatly lessened. The factors influencing the choice of the group¬ 
ing interval are two! first, its size should be such as to permit at 
least 10 or 12, but not more than 20, classes or groups; and second, 
it should promote tabulating convenience. Suggestions for choos¬ 
ing tabulating intervals are: (1) determine the range of measures 
or scores, i.e., the difference between the lowest and highest; (2) by 
inspection determine whether the range can be divided into 12 to 
20 equal intervals of some convenient size, say 5 or 10; and (3) let 
the lower number of each interval be a multiple of the size of the 
interval. It is customary to arrange the tabulation sheet with 
the highest or largest values of the variable at the top and to use 
either dots or tally marks when tabulating. The tallies per inter¬ 
val can be counted and recorded to the right of the tally marks. 
This column is usually labelled/, and the sum of thefs will be Nj 
or the total number of individuals in all the grouping intervals. 
Tabulation results in a frequency table or frequency distribution^ 
such as that shown in the first two columns of Table 1. 

Table 1, Frequency Distribution of IQ’s for 161 Fivb-yeartOld Boys 


Interval 

/ 

Smoothed / 

Cumulative / 

160-169 

1 

.3 

161 

150-159 


1.3 

160 

140-149 

3 

4.0 

160 

130-139 

9 

13.7 

157 

120-129 

29 

25.7 

148 

110-119 

39 

34.3 

119 

100-109 

35 

35.3 

80 

90- 99 

32 

25.0 

45 

80- 89 

8 

14.0 

13 

70- 79 

2 

3.7 

5 

60- 69 

1 ' 

1.3 

3 

50- 59 

1 

1.0 

2 

40- 49 

1 

.7 

1 


It should be noted that the expressed interval limits in a fre¬ 
quency table are not necessarily the actual limits. Thus, if weight 
has been taken to the nearest pound, the actual limits of the inter- 



7 


Graphic Presentation 

val 130-139 would be 129.5 and 139.5; but if the ages of individuals 
have been taken as at the last birthday, the interval 20—24 would 
have actual limits of 20 and 24.999+. Obviously for purposes of 
tabulation we need not use the implied actual limits, and for 
computational purposes we usually need either the lower limit or 
the midpoint of certain intervals, so there is nothing to be gained 
by meticulously labeling the intervals with actual limits. 



40 50 60 70 80 90 100 110 120 130 140 150 160 
Fig. 1. Histogram for data of Tabic X. 


GRAPHIC PRESENTATION 

If one scrutinizes the tally marks or the frequency table, he 
can obtain some notion as to how the individual values are dis¬ 
tributed. A number of pictorial schemes have been suggested 
as aids in the study of frequency distributions. It is possible to 
lay off the various values (or intervals) of the variable on the 
horizontal or x axis, and to let the vertical or y axis represent the 
frequency per value or interval. The frequencies of the several 
intervals can be represented by drawing a horizontal line across 
each interval at the height corresponding to the number of cases 
in that interval, and then connecting these horizontals with verti¬ 
cals erected at the interval limits. This yields a histogram (Fig. 1). 
Using the same arrangement of the vertical and horizontal scales, 
one can merely indicate the frequency with a dot or cross placed 
directly above the midpoint of the interval, and then connect the 
adjacent points with straight lines. This results in a frequency 



8 Tabular and Graphic Methods 

polygon (Fig. 2). Such a polygon or the corresponding histogram 
will usually show irregularities; on the assumption that these are 
due to the operation of chance, one can draw a smooth curve, 
cutting as near the points as possible, and this curve can be thought 
of as giving a better picture than the original polygon. A curve 
which is obtained by freehand drawing or by graphic smoothing 
schemes or by repeated smoothing of the frequencies by a method 
of moving averages is known as a frequency curve. One method of 



Fig, 2, Frequency polygon for data of Table 1. 

moving averages is illustrated in Table 1, in which an average is 
taken over three intervals. The smoothed value for an interval 
is obtained by summing the frequencies in that interval and the 
two adjacent intervals and dividing by three. Thus the smoothed 
value for the interval 80-89 is equal to the sum of the frequencies 
2, 8, and 32, divided by 3. For the 90 interval, 8, 32, and 35 are 
summed and divided by 3. The student should plot both the 
original and smoothed frequencies so as to compare the two graphs. 

Another type of graph can be obtained by the use of cumulative 
frequencies. In Table 1 will be- found a column headed “Cumu¬ 
lativeThese values are obtained by successive adding of the 
frequencies, beginning with the lowest interval. Adding 1 and 1 
gives 2, adding to this the next frequency gives 3, to which in 
turn is added the next, giving 5, and so on until we have 160 plus 1 
for the last cumulative value, which is the total munber of cases. 
Obviously, from the cumulative table one can tell how many 



9 


Graphic Presentation 

individuals fall below a given point. If one plots the cumulative 
values and connects the plotted points, an ogive curve results 
(Fig. 3). Note that, in plotting the cumulative frequencies, one 
does not use the midpoint of the interval, but rather the upper 
boundary. Why? 

The use of frequency polygons in the comparison of two groups 
is quite simple and often very enlightening. All that is necessary 
is to plot the data for both groups on the same sheet and with 



Fig. 3. Ogive for data of Table 1. 

reference to the same axes. If the number of cases in the two 
groups differs markedly, a better comparison can be obtained by 
converting the frequencies for each group to percentages of the 
total number in each group. Polygons based on percentage fre¬ 
quencies will not portray differences which are merely a reflection 
of differing N^s and therefore are more comparable. A glance at 
two such frequency polygons will reveal whether the two groups 
show marked differences in the trait in question or to what extent 
the two distributions overlap. More refined methods for com¬ 
paring groups will be discussed later. 

When one wishes to picture a discrete series, it is. customary to 
use either horizontal or vertical bars, separated from each other, 
to represent the several frequencies. As in the case of frequency 
polygons and histograms, there are no hard and fast rules regard¬ 
ing the heights (or lengths) of the bars relative to the horizontal 
(or vertical) base. The student should attempt to avoid extreme 



10 


Tabular and Graphic Methods 

lack of proportion. Often in newspapers and magazines one finds 
that frequencies have been represented as areas or solids. A 
circular diagram, or pie chart, in which the sizes of the separate 
sectors represent the percentage falling into given groups or 
classes is sometimes used to picture relative frequencies. There 
is some evidence, and a general consensus of opinion, that some 
type of linear graph is less likely to be misinterpreted than one 
depending upon areas or solids. 



1234567 1234567 

Practice period Practice period 

Fig. J^. Learning curve (same data Fig. 5. Learning curve (same data 
as in Fig. 5). as in Fig. 4). 

Another type of graphical representation is used to picture the 
relationship between two variables, e.g., growth in stature and 
age, or price change with year. To make such a line graph, one 
can lay off time or age or trials, on the horizontal axis, choose a 
convenient scale on the y axis for the other variable, and then 
plot the observational values. The line graph should be arranged 
so that the graph is read from left to right and from the bottom 
to the top, and the scales on the two axes should allow the inclu¬ 
sion of all observed values of the two variables and at the same 
time permit of a well-balanced or well-proportioned picture. A 
line graph can be made misleading by the choice of the scales on 
the two axes. For instance, if one is plotting the practice curve 
for card sorting (number of cards sorted on y axis, trial number on 
X axis), it is possible to make a tremendous difference in the appear¬ 
ance of the graph simply by altering the scale on the y axis. Of 
two curves which represent the same relationship, one (Fig. 4) 





11 


Graphic Presentation 

would give the impression that the learning had progressed quite 
rapidly, whereas the other (Fig. 5) would lead one to think that 
progress was slow. The student will do well to develop a healthy 
scepticism of all graphs which he encounters for the simple reason 
that either scale can be so selected as to lead to gross misinterpre¬ 
tation. 

It should be noted that smoothing may be applied to line graphs 
as well as to frequency polygons. Often, if a line graph is smoothed, 
the relationship between the two variables can be more adequately 
characterized. Smoothing out the irregularities will help one to 
see whether the relationship is linear or logarithmic or parabolic 
or of some other common type. Frequently a verbal description 
of a curve will aid in understanding something of the functional 
rclatedness of the two variables. To state a relationship in more 
exact mathematical language involves the application of some 
form of curve fitting by which the constants of the equation can 
be determined. 

The student who is interested in a complete discussion and 
treatment of graphic methods is referred to books on the subject 
by Brinton and by Arkin and Colton.* 

* Brinton, W. C., Graphic presentation. New York: Brinton Associates, 1939; 
Arkin, Herbert, and Colton, R. R., Graphs^ how to make and use them. New 
York: Harper, 1936. 



CHAPTER 


Describing Frequency Distributions 


It has been implied in Chapter 2 that a variable, such as height, 
IQ, or reading ability, can be represented by X, where X takes on 
various values, i.e., varies from individual to individual. Obvi¬ 
ously, X is not used here to represent an unknown but rather as 
a symbol for any of several known quantities. It has also been 
implied that the frequency with which a certain X value, or a 
certain group of X values, occurs can be represented on the verti¬ 
cal scale as an ordinate or Y value. Thus, when a frequency 
polygon is drawn, it represents the relationship between a variable 
X and a frequency F, where Y is the number of individual scores 
falling at a particular X, This relationship is frequently found 
to be a curve which has a peak or maximum near the center of the 
X's and drops off gradually toward the base line or x axis on either 
side of the point of maximum value. In other words, a typical 
frequency curve (or polygon) or a frequency distribution can be 
roughly characterized as one which shows four chief features: a 
clustering of individuals toward some central value, dispersion 
about this value, symmetry or lack of symmetry, and flatness or 
steepness. Many variables or traits yield distributions which are 
said to be approximately bell-shaped, but such a description is 
not adequate for scientific purposes. One wishes to know about 
what particular value and with how much scatter the individual 
scores are distributed, to what extent the distribution is sym¬ 
metrical, and to what degree it is peaked or flat. That is, we need 
measures of central value or tendency, measures of scatter or dis¬ 
persion or variability, and measures of skewness and kurtosis. 
With such measures, one can describe the distribution mathe¬ 
matically, and in such a way that a statistically trained con- 

12 



The Mode 13 

temporary, say in Melbourne, can picture to himself the frequency 
distribution. 

Ihus we are led to a consideration of the various measures of 
central value, dispersion, skewness, and kurtosis. It is adequate 
and more economical of time to determine these measures from 
frequency distributions rather than from the original undistributed 
scores. Since the computation of the descriptive terms frequently 
involves a determination of the lower limit or midpoint of a class 
interval, the student should recall what has been said about actual 
and expressed class limits. Obviously, if one needs the midpoint 
of an interval, it is necessary only to add one-half the size of the 
interval to the actual lower limit, which must be determined by a 
consideration of the nature of the scores or measures which consti¬ 
tute the variable. Psychological measurements and test scores 
are usually treated as though rounded to the nearest value. 

MEASURES OF CENTRAL VALUE 

The mode. A glance at a typical frequency distribution will 
indicate to us the most frequently occurring X value, or for grouped 
data the group of X values which has the greatest frequency. 
This maximal frequency roughly defines the mode. For non- 
grouped data the mode is the X value having the greatest fre¬ 
quency, whereas for grouped data the mode is taken as the mid¬ 
point of the interval which has the greatest frequency. For a 
smoothed frequency curve, the mode is the X value at which the 
curve reaches its maximum height. The mode is one indicator of 
central value, but as a descriptive statistic it has serious limita¬ 
tions. If one uses a different size interval, the mode may be de¬ 
cidedly different. Furthermore, it occasionally happens that two 
nonadjacent intervals have the same maximal frequency, thereby 
yielding two modal values. Such a distribution is said to be 
bimodal, but it should be noted that the bimodality may not be 
real but merely accidental, the resultant of the particular grouping 
interval which has been chosen. In dealing with certain discrete 
series, like size of family, the modal value is apt to be more typical 
than some other measure of central value and therefore should be 
used, even though as a measure it is subject to greater sampling 
fluctuations than either the mean or the median. (The question 
of sampling cannot be discussed at this time; the student is asked 



14 Describing Frequency Distributions 

to take on faith statements regarding the efficiency of a given 
statistic.) 

The median. As a measure of central value, the median is de¬ 
fined in two ways: (1) if the individual scores are arranged in order 
with respect to some trait, the median is the value of the midmost 
individual if N is odd, or lies midway between the two middle 
individuals when N is even; (2) when a distribution has been 
made, the median is defined as the point on the scale such that 
the frequency above or below the point is 50 per cent of the total 
frequency. For grouped data, the median may be determined by 
the following steps: 

1. Find one-half of N, 

2. Count the frequencies in a cumulative manner from the bot¬ 
tom up to that interval, say the sth, the frequency of which if 
included would give more than, if not included less than, N/2 
cases. Obviously the median will fall somewhere in this interval 
unless exactly half the values fall below the lower limit of an 
interval, in which case this lower limit is the median. Let Fc 
equal the total frequency up to the sth interval, and let Fa equal 
the frequency in the sth interval. 

3. {N/2 — Fc)/Fa will be the proportional distance required in 
the sth interval to locate the median. 

4. Letting i equal the size of the interval and LLs the lower 
limit of the sth interval, the median will be given by 

N/2 - Fc 

Mdn = LLs + i — - (1) 

Fa 

This involves the defensible assumption that the scores for the 
cases falling in the sth interval are distributed fairly evenly over 
the possible score values in the interval. 

The calculation of the median is illustrated in Table 2, in which 
is given the distribution of scores made by 60 college men on the 
Brown spool packer. The score is the number of spools packed in 
four 1-minute trials. 

The chief merits of the median are its ease of computation, its 
independence of extremes (it can be determined even if a known 
number of extremes have not been measured), and the fact that 
it is not affected by the size of extremes. This last point will be 
clearer after a discussion of the mean. 



The Mean 


15 


Table 2. The Calculation op the Median 


Score 

/ 


310-319 

1 


300-309 

2 


290-299 

4 

N/2 « 25 

280-289 

1 

8th interval is 260-269 

270-279 

6 

Fc - 24 F, = 12 

260-269 

12 

i = 10 

250-259 

11 

LLs = 259.5 

240-249 

8 

, , 25 

230-239 

2 

Mdn = 259.5 + 10 — 

220-229 



210-219 

3 



50 


The mean. This arithmetic average will already be familiar to 
most readers. The mean is defined simply as the sum of all the 
scores or measures divided by their number or 


SX 

N 


( 2 ) 


where X represents any score, the symbol S means ‘‘the sum of,” 
and N is the total number of cases. When N is small, this defini¬ 
tion form can be used to compute the mean, but when N is large, 
say 50, 100, or more, such a method is not economical of time. 
Ordinarily, when N is large, one makes a frequency distribution 
from which it is possible to compute the mean and median and 
other statistical measures. Assuming that the midpoint of an 
interval is typical of all the individuals in the interval, one can 
obtain the mean by summing the products of the several interval 
midpoints by their respective frequencies and dividing this sum 
by N, The error introduced by the use of midpoints is nonsys- 
tematic, i.e., tends to be ironed out so far as the computed mean 
is concerned. 

The calculation of the mean can be shortened still further by 
the use of a guessed average from which deviations are taken in 
terms of step or class interval units. Thus, if we guess that the 
mean will fall at the midpoint of the rth interval, the (r + l)th 
interval deviates one step interval from the guessed mean, and 



16 Describing Frequency Distributions 

so on. With this in mind the mean can be calculated from 

Sd 

M^GA+i— (3) 

N 

where GA is the guessed average (corresponding to the midpoint 
of an interval), d is the variable deviation in step intervals of the 
midpoints of other intervals from GA, and i is the size of the group¬ 
ing interval. Obviously d may be positive or negative, and the 
term Sd is really equivalent to 2/d, which indicates explicitly that 
the d value for any interval is to be summed as many times as 
the frequency (/) in the interval. Table 3 indicates the calcula¬ 
tion of the mean from grouped data by the use of a guessed aver¬ 
age and deviations therefrom in terms of step intervals. 




Table 3, Calcuiation op the Mean 

Score 

/ 

d 

fd 


310 

1 

+5 

4-5 


300 

2 

+4 

-fS 


290 

4 

+3 

+12 


280 

1 

+2 

+2 


270 

6 

+1 

+6 

2d =» 33 + (-48) - -15 

260 

12 

0 

+33 

2d 

i — = -3.00 

250 

11 

-1 

-11 

^ o.uu 

240 

8 

-2 

-16 

M « 264.5 - 3.00 * 261.50 

230 

2 

-3 

-6 


220 

0 

-4 

0 


210 

J 

-5 

-15 



50 


-48 



The same value could be obtained by selecting an arbitrary 
origin (AO) at the midpoint of the lowest interval, 210-219, 
taking deviations therefrom (all will be positive), and substituting 
in 

2d 

M-=AO + i— (3a) 

N 

The advantage of this last scheme when a calculating machine 
(Monroe or Marchant or Friden type) is available will be obvious 
to those familiar with such machines. The reasonableness of 
using deviations from an arbitrary origin may be made clear by 
the following. Suppose we wish to determine the mean height of 



The Mean 


17 


a group of individuals. We can measure each personas height 
from the floor, or as so much in excess of a stationary bar 6 feet 
from the floor. The sum of the excesses divided by N will be the 
mean excess, and obviously to this we must add 5 feet to obtain 
the mean height of the group. 

Formula (3a) can readily be derived. Each score or value of X 
can be expressed as 

X = AO + id 

in which AO and i are constant and d varies. From the definition 
formula for the mean we have 


M = 


2X 

1 ^ 

2(^0 + id) 

N 


2(A0) + 22d 
N 


2 (AO) will equal N{A0) because summing a constant N times is 
the same as multiplying it by N. As an exercise, the student 
should demonstrate by taking varying numbers, each multiplied 
by a constant, that 2zd = ihdy i.e., a constant can be taken out 
of or from under the summation sign. Hence we have 


N(A0) ii:d 2d 

M = ^ + — = AO + z — 

N N N 


The beginning student who is puzzled about which measure to 
use, the median or the mean, should remember that the purpose 
of measures of central value is description. When one is attempt¬ 
ing to reduce a mass of scores or a distribution of measures to a 
few descriptive constants, the mean and median are both descrip¬ 
tive terms which more or less adequately depict the *‘average” 
or typical score, and the choice between the two is frequently 
determined on the basis of which is more typical. Thus, if six 
men run 100 yards in 9.6, 9.7, 9.8, 9.9, 10.0, and 14.0 seconds, 
the mean value of 10.5 is not as typical as the median value of 
9.85. In general, the mean is not as typical as the median when 
there are extreme measures in one direction. However, when the 
scores are distributed in an approximately symmetrical fashion, 
the mean and median will be equal or nearly so, and either will 
be as typical as the other. The mean in this case has two distinct 
advantages over the median: (1) It is usually a more stable meas- 



18 


Describing Frequency Distributions 


ure in the sampling sense, i.e., if we regard our scores as based on a 
sample of N individuals and then take another sample, the means 
of the two samples will in general show closer agreement than the 
two medians. This point will be discussed in more detail in the 
chapter on sampling errors. ( 2 ) It can be handled arithmetically 
and algebraically. The student should prove that, if the mean of 
Ni cases is Mi, and of N 2 cases is M 2 , the mean of the two groups 
combined will be given by 


Mt = 


NiMi + N2M2 
N1 + N2 


The median cannot be handled in such a fashion. Furthermore, 
the mean is used in connection with more advanced topics in 
statistics, whereas the median is seldom mentioned. Thus, unless 
the distribution is markedly skewed, the mean should be used. 
The problem of describing skewness will receive consideration 
after measures of variation have been discussed. 

As exercises, the student should show algebraically or to his own 
satisfaction by numerical examples that ( 1 ) if a constant is added 
to or subtracted from the scores of a group, the new mean will be 
M + C ov M — C, where C is the given constant and M the mean 
of the original scores; ( 2 ) if all the scores are multiplied by a con¬ 
stant, C, the new mean will be CM, whereas dividing by a constant 
will lead to M/C as the new mean. 


MEASURES OF VARIATION 

The description of the extent of scatter (or cluster) about the 
central value may be obtained by any one of several measures. 
These measures differ somewhat in interpretation and usefulness. 
One may doubt whether the range (highest to lowest score) is of 
sufficient value in psychological research to justify its use as a 
measure of variation. It is, obviously, determined by the location 
of just two individual measures or scores and consequently tells 
us nothing about the general clustering of the scores about a cen¬ 
tral value. 

Quartile deviation. An easily computed description of dis¬ 
persion is the qiuirtile deviation (Q), defined as (Q 3 — Qi)/2, in 
which Qz (or the third quartile) is the point above which one-fourth 
of the cases fall and Qi (or the first quartile) is the point with 
three-fourths of the cases above. Q 2 (or the median) has already 



Percentiles 


19 


been defined as the point above which one-half of the cases fall. 
The computation of the two quartiles Q 3 and Q\ from grouped 
data is essentially the same as that of the median. For instance, 
in determining the third quartile we count up to the interval in 
which the point falls which divides the number of cases into two 
parts: three-fourths below and one-fourth above. The distance 
into this interval is found in exactly the same nianner as in com¬ 
puting the median. Since the quartiles are not influenced by ex¬ 
tremes, it is customary to use them along with the median. By 
definition, 50 per cent of the cases fall between the first and third 
quartiles, but in nonsymmetrical distributions it is not likely that 
the limits indicated by the median plus and minus Q will include 
50 per cent. It would seem better to report both the first and 
third quartiles, instead of Q, as these values along with the median 
will enable one to picture whether or not the clustering above the 
median is different from that below the median. 

Percentiles. Closely allied to the quartiles are the percentiles. 
The Pth percentile is defined as a point below which P per cent of 
the cases fall. Thus the median is the 50th, the third quartile 
the 75th, and the first quartile the 25th percentile. The 10 th, 
20 th, • • • 90th percentiles are sometimes called deciles. The com¬ 
putation of the percentiles from grouped data is accomplished in 
the manner indicated for computing the quartiles. The location 
of the zeroth and 100 th percentiles is always perplexing. Since 
these two points are dependent upon the location of just two scores 
(i.e., are greatly influenced by chance), they are difficult to inter¬ 
pret. Common sense would suggest that the concept of these two 
percentiles be dropped. 

The use of percentiles, or the difference between percentiles, as 
an indication of dispersion should be obvious. In fact, the 10th- 
90th percentile range is a somewhat better (more stable from 
sample to sample) measure of dispersion than the quartile devia¬ 
tion. Percentiles, however, are chiefly of value in reporting the 
scores of individuals on psychological and educational tests. Ordi¬ 
narily a raw score gives no inkling of what it means, whereas when 
it is said that an individual scores at or near the 85th percentile, 
the implication is that 15 per cent of his fellows score higher or 
better than he. Thus a percentile score carries with it some idea 
of the location of the individual with reference to the group. 
Furthermore, percentile scores for entirely different tests are com¬ 
parable if derived from the same group or sample. The original 



20 


Describing Frequency Distributions 


raw scores might be different units, e.g., number of additions per 
minute and time to read a page of prose, and consequently not at 
all comparable. 

The average deviation. Sometimes called the mean deviation 
or mean variation, the average deviation {AD) is defined as the 
average of the deviations of the several scores from the mean. 
Thus, if a: = X — M, then AD = S j x \/Nj where j x ] is the 
absolute value of x, i.e., the negative deviations are treated as 
though positive. When N is small, the average deviation can 
readily be computed from its definition, but for a large number of 
cases this would be too cumbersome. Its value for grouped data 
may be obtained by 

— 2 

AD = (FM - zSd/ - FAO) - (4) 

where M = mean. 

F = total number of cases in intervals with midpoints 
below the mean. 
i = interval size. 

AO = arbitrary origin taken as the midpoint of lowest 
interval. 

2d/ = sum of deviations of the F cases from AO in step units. 


Table 4- Computation of the AD 


Score 

/ 

d 



310 

1 




300 

2 




290 

4 




280 

1 




270 

6 


M = 261.50 


260 

12 


2/N = .04 

AO = 214.5 

250 

11 

4 

F * 24 

and 

240 

8 

3 

i = 10 

AD = 16.32 

230 

2 

2 

Sd/ * 72 


220 


1 



210 

3 

0 




60 


This formulation, which is particularly well adapted for use with 
a calculating machine, is not unwieldy for longhand computation. 
The use of formula (4) is demonstrated in Table 4. 



The Standard Deviation 


21 


Contrasted with the quartile deviation, the average deviation 
gives weight to extremes, and for the usual bell-shaped distribu¬ 
tion the limits M plus and minus AD will include about 57.5 
per cent of the cases; the average deviation is larger than Q but 
not so large as the standard deviation, to which we now turn. 

The standard deviation. A third measure of variation, the 
standard deviation {SD or cr), is defined as 



where rc = X — ilf. To compute the standard deviation directly 
from this formula would be very cumbersome and uneconomical^ 
since x will usually involve decimals. A computational formula 
involving deviations from an arbitrary origin (AO) can be easily 
derived by algebra. Such a derivation is included here in order 
further to familiarize the student with the method of handling 
summation signs. The derivation will be carried through for cr^, 
technically known as the variance; then at the end we can take the 
square root to obtain <r. 

From formula (5) we have 


<7^ 


IT 


in which x = X — M. 

As in deriving formula (3a), we can set 


X = AO + id 


and since M = AO + i(Zid/N), we have, substituting in x 
X - M, 


/ Sd\ 

x = AO + id — ( AO + i —j 


= id ic 


where for convenience we let c stand for Xd/N. 

0? = {id — tc)^ = ^{d — c)^ 
Sx2 = i2S(d - c)2 

= i^i^d^ - 2cSd -t- Ac^) 



22 


Describing Frequency Distributions 


Dividing both sides by N, we have, 



hence 

«r = ^ ViV 2 d 2 _ (Sd)2 (6) 

where Sd = the algebraic sum of the deviations (in step units) 
from an arbitrary origin. 

= the sum of the squares of the deviations (in step units). 

The arbitrary origin may be taken as the midpoint of the lowest 
interval or as a guessed average near the center of the distribution. 
The advantage of the latter procedure is that the d^a will be rela¬ 
tively small and consequently will not lead to the handling of large 
numbers, whereas the first procedure avoids the use of negative 
numbers and is more readily adaptable to machine computation. 
Both methods will be illustrated. 

Table 6, Calculation op SD by Use op a Guessed Average 

Score f d fd f(f 

310 1 5 5 25 

300 2 4 8 32 

290 4 3 12 36 

280 1 2 2 4 

270 6 +1 6 6 Zd « -15 

260 12 0 0 0 Zc? = 239 

250 11 -1 -11 11 <T V50(239) - (-15)2 

240 8 -2 -16 32 . = 21.66 

230 2 -3 -6 18 

220 0 -4 0 0 

210 3 -5 -15 75 

50 -15 239 

It will be noted that Table 5 is the same as Table 3 except that 
we now have an fd^ column, and of course the substitution of —15 



The Standard Deviation 


23 


into formula (3) will again yield the mean. It will also be seen 
that the fcP values can be obtained by multiplying the d values 
by the corresponding /d’s. 

Table 6 gives the detailed steps for computing the standard 
deviation by taking deviations from an arbitrary origin located at 
the bottom of the distribution. The student will find that 
Sd = 235 yields the mean when substituted in formula (3a), and 
that the standard deviation obtained in Table 6 is the same as 
that obtained in Table 5. 

Table 6. Computation op M and SD by Use op an Arbitbart Origin 




AT 

THE Bottom op the Distribution 

Score 

/ 

d 

fd 



310 

1 

10 

10 

100 


300 

2 

9 

18 

162 


290 

4 

8 

32 

256 


280 

1 

7 

7 

49 


270 

6 

6 

36 

216 

'Ld = 235 

260 

12 

5 

60 

300 

Sd* = 1339 

250 

11 

4 

44 

176 

(7 = 21.66 by formula (6) 

240 

8 

3 

24 

72 

M * 261.50 by formula {3o) 

230 

2 

2 

4 

8 


220 

0 

1 

0 

0 


210 

3 

0 

0 

0 



50 235 1339 


The fd and fd^ columns need not appear on the work sheet when 
we are computing the mean and standard deviation by a Monroe 
or Marchant or Friden type calculating machine. The two re¬ 
quired sums can be obtained by punching in the lowest d in the 
right-hand part of the keyboard and the corresponding d^ just left 
of the center of the keyboard, multiplying both simultaneously by 
the given frequency, and then, without clearing the lower dial, 
punching in the next larger d and its square, and so on. The suc¬ 
cessive products so obtained will be accumulated by the machine 
so that Sd is read directly from the right-hand side of the lower 
dial, and is read from near the center of the same dial. If 
either an eight- or ten-bank machine is used, the d^s of 9 and less 
are punched in the right-hand column of the keyboard, and higher 
values will of course require the first two columns. The squares 
of the d's will ordinarily be less than 400, rarely greater than 961, 



24 


Describing Frequency Distributions 

so that their values can be punched in columns 6, 7, and 8. The 
student should note that the squares of 1,2, and 3 are to be punched 
in column 6, the squares of 4 to 9 in columns 6 and 7, and the 
squares of 10 to 31 in columns 6, 7, and 8. The sum of the squares 
will appear in the lower dial from window 6 to the left. With a 
little practice the two required sums for a distribution of 15 inter¬ 
vals and 200 cases can be obtained in less than a minute. It 
should not be necessary to say that the computation should be 
done twice as a check. 

For use with a calculator, formula (6) has an advantage over 
formulas which involve two divisions under the radical. Thus 
we place the sum of the squares in the right-hand side of the key¬ 
board, multiply by N, and, leaving the product in the lower dial, 
punch the sum of the d’s in the keyboard and subtract it Sd times, 
and then from the dial copy the value of JV2d^ — (Sd)^. 

Briefly summarizing, it will be noted that (1) with a machine 
Sd and Sd^ taken from an arbitrary origin at the bottom of the 
distribution are no more difficult to compute than when taken 
from a guessed average, (2) all sums are positive, and (3) the two 
sums necessary for determining both the mean and standard devia¬ 
tion can be obtained in the same operation. It is helpful to write 
the d column in red on the work sheet, thereby throwing it into 
contrast with the / column. 

It has already been stated that the use of a midpoint as a repre¬ 
sentative score for all the individuals in an interval does not lead 
to a systematic error in the computed mean, but this procedure 
tends to give a standard deviation which is slightly in excess of 
the value which would be obtained by using formula (5). A 
standard deviation computed from grouped data can be corrected 
by substituting in 



The i^/12 is known as Sheppard^s correction for grouping. The 
uncorrected and corrected values differ but little when 12 or 15 
intervals have been used, and as the number of intervals is in¬ 
creased, the difference becomes smaller and smaller. If less than 
10 intervals have been used, the error may be appreciable and the 
correction should be applied. These considerations form the basis 



The Standard Deviation 25 

for the suggested rule that at least 10 or 12, and not more than 20, 
intervals be used. 

It should be noted that formula (6) can be altered so that <r may 
be computed directly from the gross scores without first making a 
distribution. Thus any score or measure, X, can be thought of as 
a deviation from the zero point on the scale, and with no grouping, 
i = unity, so that (6) can be written as 

<r = - Vni:X^ - (2X)2 (6a) 

N 


This form is useful when N is small and the are not too large 
numerically. All the scores are simply squared and then summed 
for and 'LX has the same meaning as in formula (2). 

Regarding the interpretation of the standard deviation, it can 
be said that, when we have the usual symmetrical bell-shaped 
distribution, about 68 per cent of the cases will fall between the 
limits plus and minus l<r from the mean, about 95 per cent between 
plus and minus 2cr, and nearly all the cases (99.73 per cent) be¬ 
tween plus and minus 3<r. The standard deviation, even more than 
the average deviation, gives weight to extremes and therefore may 
not be as good as the quartiles for describing the dispersion. The 
standard deviation has decided advantages over other measures 
of dispersion: (1) Typically, it is more stable from the sampling 
viewpoint. (2) It can be handled algebraically, i.e., if we have two 
groups of Ni and N 2 cases, with Mi and M 2 y and <ri and 0 - 2 , as the 
respective means and standard deviations, we can obtain the 
standard deviation for the two groups combined by 


Wi 

(Tt = \J - 


(M^i + <r^i) + N2(M'^2 + <^^ 2 ) 


N1+N2 


M\ ( 8 ) 


where the subscript t refers to the combined or total group. The 
mean for the combined group can be obtained by a formula given 
on p. 18. Formula (8) can be extended for determining the stand¬ 
ard deviation for three or more groups combined. The student 
can make this extension as an exercise. (3) The standard devia¬ 
tion is a mathematical term which has considerable importance in 
more advanced statistical work. It is usually involved in the 
determination of sampling errors and is the measure of variation 
used in the analysis of variation and in connection with correla- 



26 


Describing Frequency Distributions 

tional analysis. Therefore, xmless there are definite reasons for 
not using it, the standard deviation, instead of the average devia¬ 
tion or Q, should be used as a description of the amount of dis¬ 
persion. 

As an exercise, show that, if a constant is added to or subtracted 
from each of a set of scores, the standard deviation does not change, 
and that multiplying or dividing each by a constant will lead to 
C<T or (t/C, respectively, as the new standard deviation, where <t 
stands for the sigma of the original scores and C is the constant. 


MEASURES OF SKEWNESS AND KURTOSIS 

If a distribution is not of the symmetrical bell-shaped type, it is 
not sufficient for descriptive purposes to report only the mean and 
standard deviation. We also need a measure of the lack of sym¬ 
metry, i.e., of skewnessj and frequently it is desirable to describe 
the distribution still further by giving a measure which indicates 
whether the distribution is relatively peaked or flat-topped, i.e., a 
measure of kurtosis. 

Skewness can be described roughly by a number of measures, 
such as the difference between the mean and median divided by 
the standard deviation, or in terms of quartiles or percentiles. If 
an adequate and stable description of skewness is desired and if a 
measure of kurtosis is also needed, a method based on moments is 
to be preferred. 

The first four moments about the mean are defined as follows: 


Ui 


U2 

Us 


W4 


Sx 

y ® 

N 

IT 

IT 


(9) 


where x represents the deviation of each score from the mean of all 
the scores. For purposes of computation, the moments about an 



Measures of Skewness and Kurtosis 


27 


arbitrary origin can be determined, and then from these values we 
can obtain the moments about the mean. This procedure has 
already been employed in computing the standard deviation; i.e., 
we took deviations from an arbitrary origin. [The definition of 
the standard deviation, formula (5), was in terms of deviations 
from the mean.] If we use v to represent moments about an arbi¬ 
trary origin, the first four moments about AO can be defined as 
follows, where d is the score deviation from AO in step units: 




V2 




V4 


2 d ' 
~N 

2d2 

IT 

2d3 

IT 

2 d^ 

TT. 


( 10 ) 


When the v's have been calculated, the can be readily deter¬ 
mined from the following relationships: 

Ui = 0 

U2 = i^(V2 - V^i) = (T^ 

^3 = ^^’^(*^3 — 

W4 = i^{v^ — 

The student should note the similarity of the formula in (11) for 
the second moment to that given for the standard deviation 
[formula (6)]. 

A measure of skewness defined in terms of moments is 


gi 


= va = 


U3 


U2\^ 


( 12 ) 


For symmetrical distributions the value of gi will be zero; hence 
the departure of gi from zero can be taken as a measure of skew¬ 
ness. The deviation of gi from zero, however, must be considered 
in light of the operation of chance or in terms of sampling errors 



28 Describing Frequency Distributions 

(to be discussed later). The skewness is said to be positive when 

is positive and negative when gi is negative. 

The degree of kurtosis can be described by 

^2 = (^2 - 3) = ^ - 3 (13) 

U2 

When g 2 is less than zero, the distribution tends to be flat-topped, 
whereas for g 2 greater than zero it is relatively steep or peaked. 
When both gi and g 2 are zero or near zero, the distribution is of the 
usual symmetrical bell-shaped type, which is referred to as the 
‘‘normar^ frequency distribution. 

Formulas (12) and (13) also define and ^ 2 ) which have been 
and are still used as measures of skewness and kurtosis. Recently, 
the g measures have come into usage because of certain advantages 
which need not be discussed here. The reader who compares the 
above g*8 with the similar g^s in recent editions of R. A. Fisher^s 
Statistical methods for research workers * will note that they differ 
slightly, but these differences are negligible when N is as large as 
100 . 

It will be noted that the measure of skewness involves taking 
the third moment relative to (since U 2 = er^), and that the 
measure of kurtosis depends upon the fourth moment relative to 
For a given distribution, all the values of U 2 , and W 4 are 
in terms of the same measurement unit, say inches or pounds or 
IQ^s or minutes; hence the ratios in formulas (12) and (13) are 
pure numbers, i.e., are not inches or pounds or IQ^s or minutes. 
If we have the distribution of the weights and of the heights for 
1000 individuals, the measure of skewness for the height distribu¬ 
tion may be compared directly with that for the weight distribu¬ 
tion. This is true by virtue of the fact that for each we are express¬ 
ing the third moment relative to the amount of variability, both 
in inches for one distribution, both in pounds for the other. Like¬ 
wise, it can be reasoned that the measures of kurtosis for different 
distributions are comparable, although the distributions involve 
different measurement units. 

In order to help the reader visualize the meaning of different 
values for gi as associated with different degrees of asymmetry. 
Fig. 6 has been prepared. 

* Fisher, R. A., Siatiatical methods for research workersy London: Oliver and 
Boyd. 



Measures of Skewness and Kurtosis 


29 


When we have determined the mean and the second, third, and 
fourth moments, and from the moments have derived expressions 
which tell us the degree of dispersion, skewness, and kurtosis, we 
have a description which is adequate for most distributions. These 
measures can be used to determine the type of mathematical equa¬ 
tion which will fit an observed frequency polygon; i.e., we can 
write the equation of a frequency curve which fits the observed 



Fig, 0. Polygons with different degrees of skewness. 


frequency distribution. A distribution frequently found in psycho¬ 
logical research is of the ^‘normaF^ type, which is sufficiently 
described by the mean and standard deviation. Ordinarily it is 
not necessary to compute gi imless the distribution “appears^ ^ 
to be skewed or to compute g 2 unless the distribution seems peaked 
or flat. The nature of the research, the type of variable being 
studied, and also the size of the sample are factors which need to 
be considered in making a decision as to the necessity for computing 
measures of skewness and kurtosis. It is seldom advisable to com¬ 
pute these measures when N is less than 100. 

The student should be apprised of the fact that the rather fre¬ 
quent occurrence of symmetrical distributions for psychological 
variables may result from an artifact, and also that the occurrence 





30 Describing Frequency Distributions 

of a skewed distribution may likewise be artifactual. This is true 
because very few of the instruments used in psychological 
“measurement^' involve equal unit scales—the measuring units 
are frequently arbitrary or even accidental. Many of the variables 
are measured simply in terms of the number of items checked or 
the number of items correct. The shape of the resulting distri¬ 
butions is largely determined by the percentage checking the items 
or by the difficulty of the items. If the items are of medium diffi¬ 
culty for a group, it can be expected that the scale will yield a 
symmetrical distribution when applied to the group; if the items 
are easy, the scores will pile up towards the top (give negative 
skewness); if difficult, a piling up towards the bottom will occur. 
In the absence of equal scale imits for the measuring devices one 
cannot really say whether the distribution of, for example, arith¬ 
metic ability for a given group is symmetrical or skewed—^all that 
can be said is that in terms of the units used the distribution has a 
particular shape. 

From the foregoing it would seem that, since skewness (and 
kurtosis too) is partly a function of the accidental nature of the 
measuring units, the descriptive measures of shape would have 
little value in psychology. The fact remains, however, that some¬ 
times it is desirable to specify the skewness and kurtosis of a dis¬ 
tribution of scores merely as a part of the description of what 
happens when a scale of measurement, however arbitrary the 
units, is applied to a given group. Furthermore, it is to the stu¬ 
dent's advantage to know something of measures of skewness and 
kurtosis because we shall later have occasion to refer to them, and 
because he is apt to encounter them in more mathematical treat¬ 
ments of statistics. 



CHAPTER 4 


The Normal Curve and Probability 


By successive smoothing of a polygon (or distribution), one can 
iron out irregularities until the polygon becomes a ‘^smooth’^ or 
regular and uniform curve. We can think of this curve as being 
similar or nearly identical to what we would obtain were we to 
increase indefinitely the size of our sample and at the same time 
use smaller and smaller grouping intervals. That is, the limit of a 
polygon, as we allow N to approach infinity and the interval size 
to approach zero, is conceived to be a curve which is smooth and 
regular. Now such a uniform curve can usually be described in 
terms of a mathematical equation. The student may recall that 
the general equation for a straight line is y = ax + by and that 
y = 2x + 3 is the equation for a particular line, that 
is the equation for a circle of radius a with the origin or intersec¬ 
tion of the abscissa and ordinate at the center, also that 
y = a + bx + cx^ is the general equation for a parabola. It is 
not until we give specific numerical values to the constants that 
we have equations for particular curves. 

Frequency curves can be thought of as representing the rela¬ 
tionship between two variables: y, or the height of the curve, and 
Xy the variate or variable under consideration. Frequency polygons 
or distributions, even when smoothed, may be of various shapes: 
symmetrical or skewed, flat-topped. or steep, humped near the 
center or at one end, bimodal or unimodal, J-shaped or U-shaped, 
falling off gradually or suddenly, etc. A complete description of a 
frequency distribution is obtained when we have succeeded in 
writing the equation of the curve which “fits’’ the distribution. 
The type of curve to be fitted is chosen on the basis of certain 
criteria which are derived from the moments and the interrelations 
among the moments. The late Professor Karl Pearson developed 
the mathematics of a system of frequency curves and classified 

31 



32 


The Normal Curve and Probability 


distributions according to several ^'types’' of curves, but a com¬ 
plete exposition of these types is beyond the scope of this text. 
The so-called normal curve is one of his types, and because many 
distributions are approximately of the normal type, and since the 
question of sampling is intimately related to the normal curve, it is 
necessary for us to study in detail the properties of this curve. 

The general equation of the normal curve can be written as 



(X-M)* 

2<r2 


(14) 


in which y represents the height for any value of the variable Xj 
N is the number of cases, a is the standard deviation, M is the 
mean of the distribution, and t (3.1416) and e (2.7183) are well- 
known mathematical constants. In order to write the equation 
of a particular normal curve, i.e., one which corresponds to a par¬ 
ticular distribution, we need to know Ny My and <r. This is the 
basis for the fact that, when we have the usual bell-shaped distri¬ 
bution, we need only the mean and standard deviation to describe 
it adequately. But in order to say that a given distribution is 
really normal, it is necessary to show that the (as defined on 
pp. 27-28) are zero or approximately zero. 

Referring again to equation (14), we note that the numerator 
part of the exponent could be written in terms of deviation units, 
i.e., with X instead of X — M. The y for a positive deviation of, 
say, 10 will be exactly the same as for a negative 10 for the simple 
reason that the deviation value in the formula is squared. This 
indicates that the normal curve is symmetrical about the mean, 
and hence the mean and median coincide. When x = 0, i.e., when 
we take X = My y has its maximal value, and therefore the mean 
and mode coincide. For values of x other than zero, the height of 
the curve will be less. This is evident when we consider the fact 
that the exponent in equation (14) is negative. The height of the 
curve as we go in either direction from the mean becomes less and 
less (see Fig. 7). This dropping off is slow at first, then rapid, and 
then slow again. If we take the maximum value of y (i.e., at the 
mean) as unity, the ordinate at the point .5cr from the mean is 
about .883; at l<r, about .606; at 2<r, .135; and at 3<r, .011. As we 
go still farther from the mean, the value of y becomes smaller, 
and as x approaches infinity, y approaches zero (asymptotic). 



The Normal Curve and Probability 


33 


Theoretically, the curve never touches the base line, but so far as 
empirical distributions are concerned, y does become zero. 

For both the frequency polygon and the histogram, the frequency 
for a given interval is represented along the y axis or ordinate, but 
for smoothed curves and for mathematical curves such as that 
defined by equation (14), it is advantageous to regard the area 
under the curve for a particular grouping interval on the x axis as 
indicating the frequency for that interval. Accordingly the total 



Fig, 7. Normal curve. 

area under the curve corresponds to the total frequency, or iV, and 
the area under any given 'part of the curve, i.e., the area between 
any two x values, can be expressed as a percentage of the total. 
For example, the area included between the mean and the point 
on the base line 1<7 above the mean is 34.13 per cent of the total, 
and the area between plus and minus l<r is 68.26 per cent. The 
latter percentage has already been given on p. 25 as one way of 
interpreting the standard deviation. The limits plus and minus 
2<r will include 95.45 per cent; plus and minus 3a-, 99.73 per cent; 
and plus and minus 4<r, 99.9936 per cent. Theoretically, one must 
pass to plus and minus infinity to include all the area, but in prac¬ 
tice 100 per cent of the cases will usually fall within the,limits 
d=3o', and nearly always within the limits d=4a'. 

When we transform a set of scores to the so-called standard score 
form 

X- M X 

O’ 


z = 


<T 


(15) 



34 


The Normal Curve and Probability 

we have each score expressed as a deviation from the mean in terms 
of multiples of the standard deviation of the original distribution. 
It can easily be shown that the standard deviation of our new set 
of scores will be unity, and the mean zero. The frequency polygon 
for the standard scores will have exactly the same shape as that 
for the original scores; this transformation is equivalent to trans¬ 
lating the origin along the x axis to the point corresponding to the 
mean and changing the scale on the x axis so as to make the stand¬ 
ard deviation equal to unity. If we let the total frequency be 
unity, we can think of the total area under the curve as being 
unity. This is equivalent to sa 3 dng that N equals 1, and since with 
standard scores a also equals 1, equation (14) can be written as 


y = 



( 16 ) 


The value of 1/V^ is about .39894, and therefore at z = 0 
(i.e., at the mean) y will equal .39894, which is the maximum y for 
the normal curve of unit area and unit standard deviation. The 
ordinates for other values of z will be less. For instance, at 
dzlz, y = .24197, and at =t:2z, y = .05399. 

The percentage area under any part of the curve can be deter¬ 
mined by methods of the calculus. The area under the curve be¬ 
tween any two values, zi and Z 2 > is obtained as the value of the 
integral 



Perhaps this expression will be more meaningful to the student 
who has not studied integral calculus if the given area is regarded 
as composed of a large number of strips, each having a tiny base 
dz and a height of y. For each such strip the area will be nearly 
j/dz, and the integral sign in formula (17) simply means the “sum 
of^^ the areas of these tiny strips. 

The student of the calculus will also note that the first derivative 
of either equation (14) or (16) set equal to zero and solved will 
yield a maximum for the curve when x or z equals zero, thus prov¬ 
ing more rigorously that the mean and mode coincide. If the 
second derivative is set equal to zero and solved for x or z, it will 



Normal Curve Table 35 

be found that the points of inflection of the curve are located where 
X is ±<7 or z is ±1. 

Normal curve table. Because of the widespread use of the 
normal curve, tables of proportionate frequencies and ordinates 
for various 2 or x/o- values are available. The student need not be 
able to integrate equation (17) in order to understand a table of 
the normal curve functions. Table A of the Appendix contains 
four columns, the first of which is z or xjc values. The second 



xl<j or^ 

Fig. 8. Normal curve functions. 

column gives the area of the curve from the mean out to the corre¬ 
sponding z value, this area being the same whether z is positive or 
negative; a given z divides the curve into two parts, and the third 
column gives the area of the smaller part. The area of the larger 
part can be obtained by adding .5 to the entries in column 2. If 
one wishes to determine the proportionate area between plus and 
minus a given z, the values in column 2 should be doubled. The 
fourth column gives the y or ordinate for each of the z values. For 
purposes of reference, the meanings of the several entries in Table 
A are illustrated in Fig. 8, in which an ordinate (dotted) has-been 
erected at an x/a value of +.8. The area from the mean to +.8 
is found from column 2 as .28814; the area below this point is 
.78814, and that above is .21186, of the total area. Note that 
.78814 plus .21186 equals unity and that .78814 is .60000 plus 




36 


The Normal Curve and Probability 


.28814. The height of the curve at 2 ; = .8 is found from column 4 
as .2897, whereas the maximum height of .3989 is at the mean. 

It is frequently useful to know the relationship between the 
various measures of dispersion for a normal distribution. It can 
be shown that the following hold true: 

Q= ,8453 AD = .6745 SD 
AD = 1.1829 Q = .7979 ;SD 
SD = 1.4826 Q = 1.2533 AD 

It is also useful to know that for an N of 50 the SD will be about 
one-fifth the range, that for an JV of 200 the SD will be about one- 
sixth the range, and that for an N of 1000 the SD will be about 
one-seventh the range. 

The tabled values for the normal curve are often used in connec¬ 
tion with problems similar to the following: If a distribution of 
the heights of men is normal with a mean of 68.0 inches and a 
standard deviation of 2.5, what percentage of men are more than 
6 feet tall? We find z as the difference between 72 and 68, divided 
by (T, or 2 ; = 1.6; then from Table A we find the percentage of cases 
which fall above this z value to be 5.48. Suppose that the mean 
IQ of 10-year-old boys is 100 and the standard deviation 16. What 
percentage have IQ^s between 90 and 110? What percentage of 
lO-year-old boys would be classified as ‘^gifted” (IQ above 140)? 

The student will have noted that the answers to problems similar 
to the foregoing are possible by virtue of the fact that the areas 
and ordinates of Table A are for the standard score form of the 
normal curve with total area set equal to unity. By formula (15) 
one can pass from raw scores to standard scores and vice versa, 
and knowing N one can readily convert proportionate areas to 
frequencies or frequencies to proportions. Thus the table can be 
used with any normal distribution regardless of the original meas¬ 
urement units. 

Standard scores. Perhaps it should be pointed out at this place 
that transforming scores, when distributions are normal or approx¬ 
imately so, to standard scores leads to new sets of scores which are 
comparable. For example, inches and pounds are not comparable 
units. If a man is 71 inches in height and weighs 170 pounds, it is 
impossible to say whether he is taller than he is heavy, but when 
the 71 inches is transformed to a 2 ; of .9 and the 170 pounds to a 2 of 



Standard Scores 


37 


1.3, we are able to say that, relative to his position in the two dis¬ 
tributions, he is heavier than he is tall. Likewise, the raw scores 
on two psychological tests will seldom be comparable; changing to 
standard scores permits comparison, so that one can decide whether 
a boy^s performance on one test is better or worse than his per¬ 
formance on another. This assumes, of course, a close approxima¬ 
tion to normality, and that the means and standard deviations 
used in the transformations are based on the same or highly similar 
groups. 

Standard scores, as defined by formula (15), will involve both 
positive and negative values and decimal scores. Since these are 
awkward to use, a further transformation is frequently made in 
such a way as to yield a distribution with a preassigned M and <r, 
instead of the Af of 0 and o- of 1 which hold for the standard scores 
defined by formula (15). If we wish a distribution with a mean of 
50 and a o- of 10, we can simply multiply each z by 10 and add 50. 
Multiplying each z by 20 and adding 100 would yield a mean of 
100 and a of 20. Either of these transformations will get rid of 
negative values and permit a sufficient number of score values 
without the use of decimals. In general, if we wish to transform 
a set of scores having a mean, ikf, and a standard deviation, a, to 
new values to be called Z^s, with mean equal to any value K and 
(T equal to S, all we need to do is to apply the relationship 

/X ~ M\ 

Z = z{S) + K,. or Z = I - ) (S) + K 

which becomes 

S M 

Z = -(X)^ — (S)+K 

a <T 

The last form is the easier to use in practice, particularly with 
a calculating machine. Note that the last two terms will combine 
numerically and therefore can be placed in the lower dial as a posi¬ 
tive or negative number; then the numerical value of S/<r can be 
set in the keyboard as a constant to be multiplied in turn upon the 
varying values of X, If the machine has a continuous upper dial, 
the best procedure is to multiply by the highest X first, and then, 
without clearing the dials, to subtract once for each successively 
lower value of X, Care is needed in aligning decimals, a check on 
which can be obtained by multiplying by the X nearest M. This 



38 


The Normal Curve and Probability 


should lead to a value, in the lower dial, which is near K. With 
this setup, one can readily run off a table giving the values of Z 
for varying values of X. 

The comparability of two sets of standard scores, either as z^s 
or as Z^8 with the same mean (K) and same <r (aS), does not hold 
for skewed distributions unless the two distributions show the same 
degree and direction of skewness. This is unlikely to be the case 
in practice. There is a scheme for use with skewed distributions 
which not only leads to comparable units but which also normalizes 
the distributions, i.e., changes the distributions from skewed to 
normal. This procedure is known as T scaling, and the resulting 
scores are known as T scores. They are usually so calculated as to 
yield a mean of 50 and a <r of 10, but other values for these con¬ 
stants are possible. The detailed procedure may be found in 
McCairs Measurement* which also includes a table for expediting 
the transformation. Suffice it to say here that T scaling basically 
involves determining the proportion (or percentage) of cases ex¬ 
ceeding a given value plus half those reaching that value, and then 
entering such proportions in a table of the normal curve function 
to find the corresponding z values. Standard scores based on a 
normal distribution of original scores and T scores based on any 
shape distribution are comparable, providing they have been so 
determined as to yield the same mean and standard deviation. 
They differ only in the way in which they are computed, the stand¬ 
ard score being a linear transformation which leaves the shape of 
the distribution unchanged, whereas T scaling changes the distri¬ 
bution to the normal form. Ji we begin with an exactly normal 
distribution and convert the scores to both z’s and Ts, there will 
be a linear correspondence between the two sets of transformed 
scores. If their means and sigmas are set equal, the Z^s and Ts 
will be equal to each other. 

So far, the normal curve has been discussed as a frequency curve, 
and the area interpretation has been in terms of the number of 
individuals or percentage of cases falling between certain limits. 
This same curve is often spoken of as the normal probability curve, 
and because of its frequent use in connection with the problem of 
sampling it is necessary that we discuss briefly the concept of 
probability and relate it to the normal curve. 

* McCall, W. A., Measurement, New York: Macmillan, 1939, pp. 506-508. 



Probability 


39 


Probability. If one had a box containing 70 white and 30 black 
balls, well mixed, and were to draw 1 ball at random, the chance of 
the drawn ball’s being black is said to be 30 out of 100, and the 
chance of its being white would be .70. This can be interpreted 
to mean that, if we made 1000 random draws, each time replacing 
the drawn ball and remixing the contents of the box, the percentage 
of black balls drawn would be about 30, and of white draws about 
70. If one rolls a die, the probability of obtaining a 4 is i.e., a 
large number of rolls would yield a 4 about ^ of the time. If one 
tosses a symmetrical coin, it is usually said that there is a fifty- 
fifty chance of its landing ‘‘heads up,” or the probability of a head 
is This is another way of sa 3 dng that in the long run the pro¬ 
portion of times that the coin lands as a head will be the same as 
the proportion of times it lands as a tail. 

These very simple examples illustrate a definition of probability: 
if an event can happen in m ways and fail in n ways, all possible 
ways being equally likely, the probability of its occurring is 
m/(m + n) and of its failing is n/(m + n). That is, a probability 
figure is the ratio of the number of favorable events to the total 
number of events, and it is therefore necessary that we be able to 
enumerate events in order to arrive at a probability figure. 

If we draw a card from a pack, the probability of obtaining a 
spade is J, and the probability of drawing a club is also j, but the 
probability of drawing either a spade or a club is J plus j, or 
If we roll a die, the probability of obtaining either a 4 or a 5 is ^ 
plus or These two situations illustrate the addition theorem 
of probability: the probability that either one event or another 
event will happen is the sum of the probabilities of their occur¬ 
rences as single events. (The events must be mutually exclusive; 
i.e., if one occurs, the other cannot.) 

If we roll a pair of dice, the probability of a 2 on the first and 
a 5 on the second is ^ times or ■^. If we toss 2 coins, the 
probability that the first will land a head and the second a head is 
\ times or J, which is, of course, the probability that both will 
land as heads. Notice that the result obtained with the second 
die or coin is independent of the outcome of the first die or coin. 
These two examples illustrate the multiplication theorem: the 
probability of 2 (or more) independent events’ occurring simul¬ 
taneously or in succession (one and the other) is the product of 
their separate probabilities. 



40 


The Normal Curve and Probability 


As just indicated, if one tosses 2 coins, the probability that the 
first will land a head and also the second a head will be f times 
or which is the probability that both will fall as heads. The 
probability that the first will land a head and the second a tail will 
also be f times or But 1 head and 1 tail can be obtained in a 
manner mutually exclusive to the above; i.e., the first can land as a 
tail and the second as a head, and this combination or event has a 
probability of f, whence the probability of obtaining 1 head and 
1 tail will be j plus J, or This same result can be arrived at 
by listing all the possible combinations and taking the ratio of the 
number of favorable to the total number of possible combinations. 
The possible combinations are HHy HT, THy TT, from which we 
see that 2 out of the 4 possible events are favorable for the occur¬ 
rence of 1 head and 1 tail. We also note that 1 out of 4 is favorable 
to 2 heads. 

Suppose we were to toss 3 coins; we would have the following 
possible combinations: 

Coin IHHHHTTTT 
Coin 2HHTTHHTT 
Coin SHTHTHTHT 

The total number of possible ^‘events^^ is 8, 1 of which is favorable 
to 3 heads, 3 to 2 heads, 3 to 1 head, and 1 to no heads, thus giving 
the respective probabilities of f, f, and f. If we were to toss 
4 coins, we would have the following probabilities: 


4 heads 


1 head 

4 

3 heads 

tV 

0 head 

1 

Tff 

2 heads 

A 




The student should satisfy himself that these are the correct fig¬ 
ures by writing down aU the combinations possible and counting 
those favorable to any particular number of heads. 

Binomial distribution. The process of determining possible 
combinations becomes quite laborious for, say, 10 coins, but the 
several probabilities can be obtained by the coefficients in the 
expansion of the binomial (a + 6)’*. Thus for n = 2 (i.e., 2 coins) 
we have + 2a6 + 6^, or 1, 2, 1; for n = 3, + 3a^6 + 3a6^ 

+ 6®, or 1, 3, 3, 1; forn = 4 the coefficients are 1, 4, 6, 4, 1. In 
each case the sum of the coefficients, 2^, will be the total possible 
combinations, and the coefficients taken as ratios with the common 



Binomial Distribution 


41 


denominator, 2^, will represent the probabilities for n, n — 1, 
n — 2, • • • 0 heads. 

The student may recall that the general expansion of (a + b)^ is 


a” + na^-^b + 


n(n — 1) 
1 X 2 




n(n — l)(n — 2) 
1X2X3 




This expansion will contain (n + 1) terms and will terminate in 
For n = 10, we have the following coefficients: 1, 10, 45, 120, 
210, 252, 210, 120, 45, 10, 1, which sum to 1024, or 2 to the tenth 
power. Thus the probability that all 10 coins will fall as heads is 
1/1024; 9 heads, 10/1024; etc. If we plot these values as a fre¬ 
quency polygon—^these coefficients are frequencies in the sense 
that they represent the expected number of times for 10 heads, 
9 heads, etc., out of a total of 1024 tosses—^we will have a bell¬ 
shaped graph which will resemble somewhat the normal curve. 

Another and more useful way, for our purpose, of considering the 
binomial expansion is to use p and g, in the place of a and b, with 
p defined as the probability of success on a single element and q 
as the probability of failure, or g = 1 — p. Thus we would have 
(p + 5)^- Suppose n = 2; the expression would be p^ + 2pq + g^. 
If p = §, as in the coin situation, this would give + 2(§)(^) 
+ or J, I-, and \ as the probabilities for securing 2 heads, 

1 head, and 0 head respectively. Each term is itself a probability 
fraction; the munerators are 1, 2, and 1 as before. For n = 10, 
we would have or 1/1024, 10(^)®(§) or 10/1024, 45(^)®(^)^ 
or 45/1024, etc., as the probabilities for obtaining 10 heads, 9 
heads, 8 heads, etc. 

The chief advantage of using the p and g notation is that we can 
readily see what happens when p is not equal to Consider the 
expectation when we roll a pair of dice with “success’^ defined as 
the rolling of ^‘snake eyes.^^ We would have (p + g)^ = (i + 

= 4- 2(^) + 'll- as indicating the probability of obtaining 

2 one-spots, 1 one-spot, and 0 one-spot. If 3 dice were rolled, we 
would have + 3(-5^) + 3(-^[^) + or 

and respective probabilities for 3, 2, 1, and 0 one-spots. 

The important thing for the student to note is that these probabili¬ 
ties are definitely skewed—^not all probability distributions are of 
the symmetrical type. The student can, as a tedious exercise, work 
out the probabilities for 4, 5, 6, 7, and 8 dice, and therefrom learn 



42 


The Normal Curve and Probability 

that the shape of the distribution changes from marked skewness 
to less and less skewness as the number of dice is increased. It 
can be easily shown that, if p = |- and ? = the skewness will 
be in the opposite direction. Another proposition which the 
student can demonstrate to himself is that, for a fixed n, the 
skewness increases as p is taken farther from ^ in either direction— 
extremely small or extremely large p^s (near unity) lead to very 
marked skewness. 

The binomial expansion provides the probabilities of the theo¬ 
retically expected frequencies for given n% p% and q^s. Such 
theoretical distributions can be described as to central value, varia¬ 
tion, skewness, and kurtosis. The numerical values for these 
measures may be obtained by direct computation from the distri¬ 
butions built up by the binomial expansion, or these measures may 
be obtained by simple formulas, which can be derived by simple 
algebra, without having the actual distributions available. 

The student can, as an exercise, perform an empirical check on 
the formulas for the mean and standard deviation. The formulas 
are: 


M = 

np 


a = 

y/npq 


01 = 

q-p 
y/ npq 

(skewness) 

02 = 

1 — 6pq 

npq 

(kurtosis) 


It should be noted that n is the number of elements, not the 
number of cases. The formula for skewness permits several deduc¬ 
tions. When p = q also equals and hence the skewness is 
zero; the degree of skewness for a fixed n depends upon the devia¬ 
tion of p from i.e., the smaller or the larger the probability of 
success for each element, the more skewed the distribution. Note 
also that, since n is in the denominator, the larger the number (n) 
of elements, the smaller the skewness for fixed values of p and q. 

The above formulas describe the theoretically expected distribu¬ 
tion for given n% p’s, and q^s. As will be seen later, any empirical 
distribution obtained by tossing 10 coins or rolling 3 dice will 



Binomial Distribution 43 

yield values which, for reasons to be discussed, will only approxi¬ 
mate these values. 

It is of interest to consider plotting the binomial distribution as a 
histogram—the height of the successive bars will indicate the 
several expected frequencies, each of which is the numerator for a 
probability fraction. Now, if we work out the expected frequencies 
for number of heads when 20 coins are tossed, and if in drawing 
the histogram we so scale the ordinate as to have the over-all 
height about the same as that for the 10-coin situation and also 
squeeze the base-line scale (ranging from 0 to 20) into about the 
same over-all distance as for 10 coins, the vertical bars will be 
narrower, and the resulting picture will look more like a normal 
histogram than that obtained for 10 coins. If we repeat the process 
with n larger and larger, each time scaling our axes to about the 
same size as used for 10 coins and for 20 coins, the several bars of 
the histograms will become narrower and narrower, and with n 
sufficiently large the bars will seem to merge and the contour of 
the graph will tend to appear indistinguishable from a normal 
curve. 

The normal curve is for a continuous variable on the x axis, 
whereas the binomial distribution involves a discrete variable, or 
point series. For example, it is impossible to have any values 
between, say, 22 and 23 heads. As n is taken larger and larger, 
and the total base line is kept fixed, the obtained values or possible 
points become more and more closely spaced so that the point 
series approaches, or at least takes on the appearance of, con¬ 
tinuity. As n approaches infinity, the binomial distribution 
approaches the normal distribution as a limit. 

This would suggest the possibility of using the normal curve as a 
method for approximating the probabilities obtainable by the 
binomial distribution. In order to see how this might be done, 
note that, since the bars of the histogram are closely spaced for 
large n, we can think of the area between two X values as indi¬ 
cating the sum of the frequencies expected for the discrete points 
between the given X^s. Suppose we consider the case of n = 16. 
By the binomial expansion, it can be ascertained that the proba¬ 
bilities of obtaining 10, 11, and 12 heads are respectively 8008/2^®, 
4368/2^®, and 1820/2^®. The numerical value of 2 to the 16th 
power is 65,536, and the sum of the three probabilities becomes 
14,196/65,536, which of course represents the probability of obtain- 



44 


The Normal Curve and Probability 


ing 10 or 11 or 12 heads. If we convert this fraction to its decimal 
equivalent, we have .216614; i.e., we would expect to get 10 or 11 
or 12 heads about 216,614 times in a million tosses of 16 coins. 

The distribution of all the expected frequencies for 16 coins will 
have a mean of np = 16(^) = 8, and a cr of \^npq = V^16(^)(^) 
= 2. Now let us treat this point distribution as though it involved 
a continuous variable normally distributed with Af = 8 and <r = 2, 
If we should take of 9.5 and 12.5, we would obviously be deal¬ 
ing with an interval on the base line which would include the dis¬ 
crete points 10, 11, and 12. The area under the curve between 
these two X values can readily be determined by expressing each 
as a deviate from the mean, and then dividing the deviate by 2. 
This gives us two x/cr values: (9.5 — 8)/2 = .75 and (12.5 — 8)/2 
= 2.25. Turning to Table A of the Appendix, we find that the 
area from the mean to an x/<t of 2.25 is .48778 of the total, and the 
area from the mean to an x/<t of .75 is .27337. The difference, 
.21441, represents the area between an X of 9.5 and an X of 12.5. 
Note that this figure approximates the probability, .216614, given 
above for securing 10 or 11 or 12 heads. The error in the approxi¬ 
mation is slightly larger than .002; for n taken larger, the approxi¬ 
mation becomes better. 

Notice that, in approximating the probability, we have utilized 
an area under a curve; i.e., we have said that the area between 
two X values taken relative to a total area may be interpreted as a 
probability figure. This is not inconsistent with our original 
definition of probability involving number (frequency) of events 
favorable relative to a total number of events (total frequency). 
Since, as previously indicated, the total area under a frequency 
curve for a continuous variable (or function) can be regarded as 
the total frequency, and the area for a particular segment can be 
regarded as the frequency with which values (or scores) fall in the 
given segment, it follows that the ratio of the segmental to the 
total frequency may be spoken of as a probability—the probability 
that a score falls between the two X values defining the segment. 
When we are dealing with a distribution of the normal type, the 
probability associated with a given segment is found by convert¬ 
ing the two X values, which define an interval, into x/a values 
and then determining the area from Table A. The obtained pro¬ 
portionate area represents the probability expressed as a decimal 
fraction. 



Binomial Distribution 


45 


It should be obvious that, when we consider the unit normal 
curve, we can readily specify the proportionate area between any 
two x/or values, say zi and Z 2 ) and interpret the proportion as the 
probability of obtaining x/a values between the given Zi and Z 2 - 
By reference to tables more extensive than Table A, it can be 
found that the area between an x/a of —1.96 and an x/a of +1.96 
is very nearly .96; hence it would be said that .95 represents the 
probability of obtaining x/a values between these two points. 
Furthermore, it can be said that .05 represents the probability 
that an x/a, drawn at random from a normally distributed supply 
of {x/aYs, will be numerically larger than 1.96. Similarly it can 
be said that the probability of drawing an x/a between ±2.576 is 
very nearly .99, while the probability for an x/a falling outside 
these limits is .01. 

The foregoing interpretation of proportionate areas under the 
normal curve as probabilities is, in a sense, the basis for sometimes 
calling this curve the normal probability curve. It has been noted 
that for p not equal to q, the point binomial leads to skewed proba¬ 
bility distributions. For continuous functions it is also possible 
to have distributions, other than the normal, which permit proba¬ 
bility statements on the basis of proportionate areas. Later we 
shall consider the use of three nonnormal probability distributions. 



CHAPTER 5 


Sampling Errors and Statistical Inference 


As was stated in the introductory chapter, the function of 
statistical method is twofold: (1) the reduction of data to a few 
convenient descriptive terms and (2) deduction from these values. 
(The student should reread pp. 1-2 at this time.) Usually any 
mass of data is merely a small, and frequently a very small, sample 
of what, theoretically at least, could be collected. The sample is 
drawn from a given population, which needs to be defined very 
definitely, and any deductions made from the sample can apply 
only to the defined population. The nature of any random sample 
is always partly determined by the operation of chance factors. 
Thus, if the distribution of the heights of all the men in the United 
States were normal with a mean of 68 inches and a standard devia¬ 
tion of 2.7, and if we were to draw a random sample of 100 men, 
we would expect that this sample would yield a distribution exactly 
like that of the total population of men except for chance fluctua¬ 
tions. (Tossing 2 coins 100 times would not necessarily yield 2 
heads 25 times, 1 head 50 times, and 0 head 25 times; any dis¬ 
crepancy would be said to be the result of chance.) The operation 
of chance>might give us more than 50 men over 68 inches tall; this 
chance discrepancy might result in the sample mean and standard 
deviation being different from 68 and 2.7, and these differences 
would, therefore, be chance differences. 


EMPIRICAL DEMONSTRATION 

The operation of chance can be illustrated by tossing 7 coins 
50 times and tabulating the number of heads per toss. The ob¬ 
tained frequencies will usually vary considerably from those ex¬ 
pected, which would be proportional to 1, 7, 21, 35, 35, 21, 7, 1. 

46 



Empirical Demonstration 47 

When the mean number of heads for 60 tosses is computed, it is not 
likely to be exactly 3.5 (the mean of the expected frequencies), 
and the discrepancy from 3.5 can be attributed to chance. Like¬ 
wise, 100 tosses will show departures from the expected frequencies, 
and consequently the mean based upon 100 tosses will differ some¬ 
what from 3.5. Furthermore, the standard deviation of the 
distribution of heads will likely differ from the standard deviation 
(1.323) of the expected frequencies. The student, as an exercise, 
can demonstrate the foregoing statements by actually tossing 
coins. Indeed it will be quite instnictive if each class member 
tosses 7 coins 50 times, each time tallying the number of heads 
that turn up. This will lead to a frequency distribution running 
from 0 to 7 heads, with an N of 50. Then a second series of 50 
tosses should be made, thus providing a second distribution. The 
2 frequency distributions can be combined, so that each student 
will have 3 distributions, 2 with N^s of 50 and 1 with an N of 100. 
Next, for each of the 3 distributions the proportion of times 7 
heads, 6 heads, etc., are secured should be compared with the 
expected proportions, xiig-, Note that chance is so 

operating as to produce a distribution somewhat similar to that 
expected, but at the same time is operating in such a manner as 
to lead to discrepancies between observed and expected frequencies. 

Each student should compute the means and standard devia¬ 
tions for each of the 3 distributions. Note how far these values 
depart from the expected mean of 3.5 and the expected standard 
deviation of 1.323. Then the several means and standard devia¬ 
tions secured by the class members should be brought together. 
In order better to understand what happens when each of several 
persons tosses 7 coins 50 times, i.e., takes a sample of 50 tosses, 
a frequency distribution of the means, also of the standard devia¬ 
tions, based on 50 tosses should be made. Likewise a separate 
distribution should be made for the M’s based on 100 tosses; also, 
the standard deviations. Study these distributions carefully. 
Their central tendencies are near what values? What is the 
extent of dispersion for these distributions of M’s and <r’s? Com¬ 
pute the means and the standard deviations for these distributions 
of M’s and o-’s. Is there any difference in the dispersion for the 
distribution of means based on 50 tosses and that based on 100 
tosses? How would you account for this difference? In general, 
what is the shape of these distributions of M’s and o-’s? 



48 Sampling Errors and Statistical Inference 

In Table 7 will be found the distributions obtained by several 
of the author^s classes. Though these are not models for number 
of intervals, they are nevertheless sufficient as a basis for answering 
the foregoing questions. Note that both distributions appear to 
be normal, that both center very near the mean of the theoretical 
distribution (3.5), and that the variability for means based on 
100 tosses is less than that based on 50 tosses. It would thus seem 
that means based on 100 tosses are somewhat more stable or less 
variable than those based on 50 tosses. Does this suggest that a 
larger number of tosses, i.e., a larger sample, would tend to iron 
out the chance factors that operate to produce discrepancies be¬ 
tween an observed distribution of number of heads and the 
expected distribution obtained by the binomial expansion? Do 
you think that means based on 500 tosses would show less disper¬ 
sion than means based on 100 tosses? 

Table 7. Distributions of 600 Means Based on 50 Tosses, and 300 
Means Based on 100 Tosses, of 7 Coins 



50 Tosses 

100 Tosses 

4.00-4.09 

3 


3.90-3.99 

14 


3.80-3.89 

35 

4 

3.70-3.79 

50 

23 

3.60-3.69 

98 

58 

3.50-3.59 

119 

78 

3.40-3.49 

120 

85 

3.30-3.39 

85 

32 

3.20-3.29 

52 

17 

3.10-3.19 

21 

3 

3.00-3.09 

2 


2.90-2.99 

1 


Number of means 600 

300 

Mean of means) 

f 3.516 

3.513 

SD * distr. of i3 

pans .190 

.135 

Expected SD c 

.187 

.132 


* Corrected for grouping. I 

Each series of 50 tdpes or 100 tosses represents a tryoul or 
sample of what happens when 7 coins are tossed 50 or 100 times. 
For unbiased coins, w^infer from the binomial expansion that, if 
we made a large number of tosses, the distribution of heads would 
be proportional to 1, 7, 21, 35, 35, 21, 7, 1; and the mean and <r 



Empirical Demonstration 


49 


of this distribution would approximate 3.5, (np), and 1.323, 
(V^npg). Referring again to Table 7, we note that the standard 
deviation (.190) of the distribution of means based on 50 tosses 
is very nearly equal to 1.323 /\/m, which equals .187, and that 


Table 8 . Distribution of 93 Percentages Reported by 93 Students 
FOR Times 6 Heads and 3 Heads Were Obtained When 7 Coins Were 

Tossed 100 Times 


For 6 Heads 

For 3 Heads 

% 

/ 

% 

-1 

/ 

13 

1 

35 

2 

12 

1 

34 

3 

11 

1 

33 

2 

10 

4 

32 

6 

9 

3 

31 

4 

8 

8 

30 

3 

7 

14 

29 

6 

6 

13 

28 

8 

5 

18 

27 

8 

4 

17 

26 

10 

3 

10 

25 

12 

2 

2 

24 

7 

1 

1 

23 

4 



22 

9 



21 

3 



20 

3 



19 




18 




17 

2 



16 




15 

1 

Number of %’s 

93 


93 

Mean of %’s 

5.76 


26.32 

SD distr. of %’s 

2.26 


4.13 

Expected mean 

5.47 


27.34 

Expected SD 

2.27 


4.46 

the standard deviation of the means 

based on 

100 tosses is nearly 

1.323/VlOO. These agreements suggest that the variability of 
means may be a function of, first, the standard deviation of the 
^‘universe'^ being sampled and, second, the size of the sample. 


As a further demonstration of what can be expected to happen 
when sample tosses of 7 coins are made, each student should report 



50 Sampling Errora and Statistical Inference 

the number of times out of 100 tosses that he secured a specific 
number of heads. This number will obviously be a percentage 
and can be compared with the expected percentage. If there are 
30 members in the class, 30 percentages will be available for the 
percentage of times a given number of heads was secured. If a 
distribution of these percentages is made separately for the times 
6 heads and 3 heads turn up, results similar to those in Table 8 
can be expected. It will be noted that both distributions in this 
table tend to center about the expected percentages (5.47 and 
27.34); that the variabilities differ, although the percentages for 
each distribution are based on 100 tosses; and that the dis¬ 
tribution for 3 heads is reasonably symmetrical, whereas that 
for 6 heads shows skewness. The standard deviation of the 
pe rcentages for 6 heads is 2.26, which is very near the value of 
V5.47(100 - 5.47)/100, or 2.27, and the standard deviation for 3 
heads is 4.13, which is fairly close to V27^34(100 — 27.34)/100, 
or 4.46. The meaning of these square root expressions will become 
clearer later. 

Summarizing the results of the above empirical work, we see 
that the means for successive samples tend to distribute them¬ 
selves normally about the expected or universe mean with a spread 
or standard deviation which is near the standard deviation of the 
theoretical distribution of heads divided by the square root of the 
size of the samples. We have also seen that percentages distribute 
themselves about an expected value with a standard deviation 
given by dividing the sample size into the expected percentage 
times 100 minus the expected percentage, and then taking the 
square root of the result. There is the additional fact that one of 
the percentage distributions is skewed. 

The student should keep these empirical distributions and 
deductions therefrom in mind as we now proceed to a considera¬ 
tion of what the mathematical statistician says will happen when 
successive samples of a given size are drawn from a defined uni¬ 
verse or population or supply. We have seen that a particular 
sample, or series of tosses, will not necessarily lead to a statistical 
measure which corresponds to the expected value. The amount 
of discrepancy will vary from sample to sample (or series to series). 
Our task is to be able to specify the possible error or discrepancy 
for a statistical measure based upon a sample of N cases. 



Sampling Theory 


51 


SAMPLING THEORY 

The discussion here holds for what is known as simple random 
sampling. The conditions for simple random sampling are that 
the sample should be drawn in such a way that each individual 
(person, plant, animal, etc.) in the defined universe shall have an 
equal chance for being included in the sample, and that the draw¬ 
ing of one individual shall in no way affect the drawing of another. 
These conditions are not easily met in practice. The difficulties 
will be considered later. The aim is, of course, to obtain a sample 
which will', within limits of random or chance errors, be repre¬ 
sentative of the universe from which it was drawn. 

Let 

N = the number of cases, or size of sample. 

M = the mean of any sample (known, i.e., computed). 

<r = the SD of any sample (known, i.e., computed). 

ii = the mean of the universe (unknown). 

& = the SD of the universe (unknown). 

The if? and & values are for the distribution of scores or measure¬ 
ments for all the individuals in the defined universe. It is not 
assumed that this universe distribution is exactly normal; it may 
be normal or skewed slightly. Strictly speaking, the number of 
cases in the imiverse should be infinitely large, but failure to meet 
this requirement is not serious. As will be seen later, the adjust¬ 
ment necessary when a sample of N cases is drawn from a limited 
or finite universe of ^ cases is of the order of N/J^; if it is known 
that ^ is very large relative to N, the formulations about to be 
presented will be sufficiently accurate for all practical purposes. 

Now suppose we draw a sample of N cases, compute the mean 
and standard deviation, then draw another sample of the same 
size and compute its mean and standard deviation, and so on until 
a large number of samples, say 10,000, have been drawn. We 
will then have 10,000 means and 10,000 standard deviations, each 
based on N cases. This procedure is termed successive sampling. 
When we make a distribution of the 10,000 means and of the 
10,000 standard deviations, we have what are called random 
sampling distributions. From the viewpoint of mathematical 
rigor, the number of successive samples should be much larger 
than 10,000, certainly far larger than the 600, or 300, successive 



52 Sampling Errors and Statistical Inference 

samples of Table 7, in which we have only the beginning of two 
random sampling distributions. 

By rather complex mathematical methods it can be shown that, 
if successive samples of constant size, iV, are drawn randomly 
from a normally distributed universe or population with mean 
equal to ]& and with standard deviation equal to &, the successive 
sample means will be normally distributed about ]\i, and the stand¬ 
ard deviation of this sampling distribution of means will be 
The random sampling distribution of the successive standard 
deviations will center about & (there is a small bias here which need 
not concern us at this time). For N large (100 or more) this 
distribution will be approximately normal with a standard devia¬ 
tion equal to These mathematical findings have often 

been checked empirically. Our Table 7 and the coin tossing of 
any class provide limited checks. 

Standard error. We are now in position to define a term. The 
standard error of a statistical measure is the standard deviation 
of the sampling distribution of the given measure. The square of 
the standard error is called the sampling variance. For the prac¬ 
tical statistician, the sampling distribution is hypothetical, and 
hence its standard deviation must be determined by a different 
formula from that used for computation from an actual distribu¬ 
tion. The value given by &/y/N is called the standard error of 
the mean and may be designated as &m- It should be noted that 
the sampling distribution may be thought of in terms of standard 
measures, analogous to standard scores. Thus each successive 
sample mean can be expressed in standard form as (M — 

These relative deviates will form a normal distribution with, mean 
of zero and standard deviation of unity. By reference to Table A 
(normal curve functions), one can readily specify the chances of 
obtaining a sample mean yielding a deviation as great as M, pro¬ 
viding the value of is known. But in practical- work is the 
unknown about which we desire to make an inference on the basis 
of just one sample. 

Before resolving this practical problem, we must call attention 
to the fact that the universe standard deviation, needed to ob¬ 
tain &M is also an unknown. A single sample will yield a standard 
deviation, a, which, being a sample value, will of course deviate 
more or less from &, In order that an inference about iS" may be 
made from a single sample, &m is estimated by using i.e., 



Inference 


53 


the unknown 6 is replaced by the sample <r. Instead of th^true 
value for the standard error of the mean as given by &I\/N, we 
have an approximate value, aly/N. Let cm, defined as cjy/Ny 
stand for the approximate standard error. 

The ignorance concerning <7, and the consequent approximate 
value for the standard error of a given mean, lead to a reconsidera¬ 
tion of the sampling distribution of means. As already pointed 
out, the means from successive samples will be distributed nor¬ 
mally, and the relative deviates, (ikf — M)Icm, will likewise be 
distributed normally, since &m = ^/N is a constant. When we 

have c instead of & and wish to make an inference about the uni¬ 
verse mean, we need to know something of the sampling behavior 
of successive sample means expressed as relative deviates from M 
where cm is not a constant but varies from sample to sample, 
because the several sample standard deviations vary. Thus the 
relative dwiate of the first sample mean will be {Mi — M) divided 
by <^ily/N] for the second sample, {M^ — ill) divided by c 2 /y/Nj 
and so on. The distribution of these relative deviates will not 
approximate normality unless N is fairly large. Thus the use of 
an estimate of & in determining cm imposes the restriction that N 
shall not be too small. If N is not less than 30, we can safely use 
the normal curve as the basis for drawing an inference regarding 
the value of This chapter's discussion of sampling is therefore 
not applicable unless N is greater than 30, except in the case of 
percentages or proportipns, for which a different type of require¬ 
ment will be set up. The refinements necessary for less than 
30 will be given in Chapter 12. 

Inference. We are now ready to proceed to the practical prob¬ 
lem of making an inference from a sample to a universe value. On 
the basis of the information yielded by the sample, we are to 
make some kind of an inference or estimate concerning the popu¬ 
lation constants. This is a very practical problem in connection 
with many research projects in the social and biological sciences 
and in education. For purposes of illustration, we shall give the 
steps necessary to an understanding of the statistical process of 
making a deduction concerning a true or population mean. 

Suppose that we are desirous of learning something about the 
intelligence of high school seniors in a large city. We select 100 
high school seniors at random as a sample. Suppose that the 
mean IQ of this sample is 114 and the standard deviation is 15; 



54 


Sampling Errors and Statistical Inference 


what can be said concerning the mean of all the high school seniors? 
In order to answer this question, we say that, if we were to take 
a large number of randomly drawn samples of size 100, and make 
a distribution of the several means, these means would be dis¬ 
tributed normally about the population mean, M, and that the 
standard deviation of th^ theoretical distribution of means would 
be approximately cr/v^iV, where <r is the standard deviation of 
our sample distribution, or 15, and N is the number in the sample, 
or 100. The use of <r = 15 and N = 100 gives 1.5 as the standard 
deviation of the theoretical distribution of means, but we do not 
know what numerical value to assign to or the population 
mean. This is an unkno^vn which can never be known exactly 
unless we measure the entire population, but on the basis of the 
information yielded by our random sample it is possible to deter¬ 
mine limits between which it is likely that is located. 

Suppose that we consider the location of the sample mean, 114, 
on the IQ scale and attempt to determine the likely limits for M, 
or the population mean. To do this we set up a ^'triaF^ hypothesis 
regarding the location or value of ill, and then we decide whether 
the observed sample mean is consistent with the hypothesis. If 
the observed value, M, is not consistent with the hypothesis, we 
must revise the hypothesis; if M is consistent with the hypothesis, 
we must ask ourselves about other hypotheses with which ikf 
might also be consistent. Should we discover that the observed 
mean is consistent with a number of hypothetical values within 
given limits and inconsistent with hypothetical values beyond 
these limits, then we have established the limits between which 
ill is likely located. To say that a sample mean is inconsistent with 
a hypothetical value involves the acceptance of an arbitrary 
criterion in terms of probability, and the rigidity of the accepted 
criterion determines the degree of confidence with which we accept 
the limits for ill. 

First, let us suppose that is 120. This is a proposed statis¬ 
tical hypothesis which may be* conveniently labeled as H120 (see 
Fig. 9). If it seems unreasonable to admit that the observed 
value, 114, could be a chance deviation from 120, we must con¬ 
clude that hypothesis H120 is not borne out by observation, and 
consequently we must revise the hypothesis. If 120 were the 
population mean, successive sample means would be distributed 
about it with a standard deviation of 1.5. Now 114 deviates 6 



Confidence Limits and Confidence Intervals 


55 


from 120, and this deviation divided by <r (1.5) gives an xj^r or 
z of 4. From Table A it will be found that the probability of a 
deviation as great as or greater than 6, in the given direction, 
occurring by chance is .00003, from which we conclude that it is 
imreasonable to believe that 114 is a chance fluctuation from a 
true value for il^ of 120. We must therefore revise our hypothesis. 
Obviously, any hypothetical value for greater than 120 would 
be rejected because the probability of the occurrence of a sample 
mean of 114 or less as a deviation from an M greater than 120 is 



H1I5 HllS H120 

Fig. 9. Sampling hypotheses. 


even smaller than .00003. Hence we would conclude that iO* is 
somewhere below 120. Let us take 118 as hypothesis HllS (see 
Fig. 9). The deviation of 114 from 118, divided by 1.5, gives an 
x/cr of 2.67, and from Table A we note that the probability of 
obtaining a deviation that large or larger, in the given direction, 
is about .0038. We cannot discard HllS as readily as H120, but 
we can be fairly sure that the true or population mean is not as 
great as 118. As another statistical hypothesis, let us take 115 
(HI 15) as a possible value for the true mean. The sample value 
deviates 1 from this, x/a = .67, and the corresponding probability 
figure is about .25. From this it can be argued that 114 could 
easily be a chance fluctuation from a true value of 115 or from a 
somewhat higher value. The observed 114 is not inconsistent with 
115, nor is 114 inconsistent with 115.5, 114.5, 113, or other near¬ 
by values. 

Confidence limits and confidence intervals. In order to 
decide more definitely on a value which can be considered the 
upper limit for the location of it is necessary that we agree on 


56 Sampling Errors and Statistical Inference 

what is meant by ''reasonably" sure. One criterion accepted by 
some is that the limit can be taken as the point which is 1.96. 
sigma units from the observed mean. Thus 116.94 would be the 
upper limit in the present problem. If 116.94 were the popula¬ 
tion mean, the chance of obtaining a sample value as low as 114 
would correspond to the probability figure .025, and of course 
the probability of 114 as a chance deviation from an ]Si above 
116.94 would be still smaller. To determine the lower limit for M 
we could again go through the process of setting up a series of 
trial hypotheses, but once we have accepted the arbitrary criterion 
given above, it is necessary only to subtract 1.96 times 1.5 from 
114. The resulting 111.06 is the lower limit. Thus in practice 
all we need to do is to take M zt as limits for if?. With 

what level of confidence can we assert that the population mean 
is between the limits indicated by ilf zb 1.96(rM? Note again 
that the probability of obtaining a sample M as low as 114 as a 
chance deviation from 116.94 (or M + 1.960-^) is .025 and that 
the probability for getting a sample value as high as 114 as a devia¬ 
tion from 111.06 (or M — 1.96<rjif) is also .025. Since these two 
probabilities add to .05, it might seem on first thought that the 
probability is .95 that the population mean lies between the limits 
M ± 1.96<7m- However, this would be incorrect because there is 
just one value for the population mean, so that it is impossible 
to have a distribution of population values about the available 
sample value. Note that, when it is said that the probability is 
.95 that a sample mean will fall between the points ]Q[ — 1.960-^ 
and iS" + 1.96<rjif, we are basing our statement on the fact that 
it is possible to have a large number of sample means distributed 
(normally) about the population mean. 

What type of probability statement can we make concerning 
the population value? It seems obvious that there should be 
some way of expressing our degree of confidence that the popula¬ 
tion mean lies between the limits M zk 1.96(rjjf, since, as we have 
seen, we can be somewhat sure that the sample mean is not a 
chance deviation from a population mean outside the limits so 
determined. 

In order to arrive at a confidence statement, we note that, if 
we drew a second sample, we would be apt to have a different set 
of limits for the simple reason that the second sample mean may 
differ from the first. If we took additional samples of the same 



57 


Confidence Limits and Confidence Intervals 

size, we would have a distribution of sample means, hence a sort 
of distribution of sets or pairs of limits, since each sample mean 
would provide a set. Our discussion can be greatly simplified 
by taking sets of limits given by M =b 2<tm (as approximating the 
M db 1.96<r3f values). For simplicity of exposition, let us assume 
that we are drawing successive samples from a population having 
a mean of 10, and that the variability and N are such that ctm can 



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 


LL curve UL curve 

Fig. 10. Generation of confidence limits. 

be taken as 2. Then M db 2<tm will be ilf ± 2(2), or M db 4. It 
will also facilitate our exposition if we think of the random sam¬ 
pling distribution of means in terms of intervals of distances 
on the base line with the approximate percentage area for the 
several intervals, as shown in the top curve of Fig. 10. 

Now each possible sample mean will lead to a lower limit of 
M — 4 and an upper limit of il/ + 4. If we consider the 
19 per cent of sample means expected between 9 and 10, we see 
at once that these 19 will lead to lower limits of 5 to 6 and to 
upper limits of 13 to 14. That is, the sample means falling be¬ 
tween 9 and 10 will generate that part of the lower limit (LL) 
curve of Fig. 10 between 6 and 6 and that part of the upper limit 
(UL) curve between 13 and 14. Likewise the 15 per cent of sam¬ 
ple means falling between 8 and 9 will lead to the 4 to 6 part of 
the LL curve and to the 12 to 13 part of the UL curve. Similarly, 





58 Sampling Errors and Statistical Inference 

as can be seen by careful study of the three curves of Fig. 10, 
every left-hand segment of the top curve generates a left-hand 
segment for each of the bottom curves. Stated differently, the 
left half of the top curve leads to a distribution of lower limits of 
less than 6 and of upper limits of less than 14, and the left-hand 
segmental frequencies of the bottom curves depend upon the left 
half of the top curve, or upon half the curve for the sampling 
distribution of means. In exactly the same fashion it can be seen 
that the right half of the top curve leads to the right half of the 
LL curve and also to the right half of the UL curve. Thus we 
arrive at a sampling distribution of limits as found by taking 
M dz 4 (or M db 2a'Af). Our next task is to ask how many of these 
various sets of limits actually include 10, or the population mean. 
Reference to Fig. 10 will verify that, out of 100 tries, we would 
expect to get: 


4 times an LL of 2 to 

3 and a UL of 10 to 11 

9 

« 

u 

u 

if 

3 to 

4 

a 

u 

it 

it 

11 to 12 

15 

a 

u 

u 

u 

4 to 

5 

u 

it 

it 

ti 

12 to 13 

19 

u 

u 

i( 

a 

5 to 

6 

u 

u 

it 

it 

13 to 14 

19 

u 

u 

a 

it 

6 to 

7 

a 

it 

it 

it 

14 to 15 

15 

u 

u 

u 

a 

7 to 

8 

u 

u 

if 

u 

15 to 16 

9 

u 

a 

u 

ti 

8 to 

9 

it 

it 

u 

it 

16 to 17 

4 

a 

u 

a 

it 

9 to 10 

it 

it 

u 

a 

17 to 18 


Notice that for every set in the foregoing groups the population 
mean is in the range or interval defined by the upper and lower 
limits of the set. When we sum the expected frequencies, we see 
that 94 per cent of the sets of limits lead to intervals within which 
the population mean lies. If we had not rounded to the nearest 
per cent, these would sum to 96.45 per cent. This implies that 
4.55 per cent of the times the intervals so defined would not in¬ 
clude the population value. This can be verified by noting that 
sample means of less than 6 (top curve) lead to upper limits of 
less than 10, and do so 2.27 per cent of the times, whereas sample 
means of more than 14 produce lower limits of more than 10 about 
2.27 per cent of the times. These percentages are for the tails of 
the bottom curves, to the left of the ordinate at 10 for the UL 
curve and to the right of this ordinate for the LL curve. 

In summary, if one were to make in his lifetime 100 inferences 
concerning population means on the basis of sample values by 
each time taking the limits as ilf db 2aMf the limits so established 






Standard Errors 


59 


would include the population value about 95 per cent of the tries. 
That is, in the long run he would be correct about 95 per cent of 
the time in concluding that the population value is within the 
limits so determined, and about 5 per cent of the time he would 
be in error. If he used M dz 1.96(r3f for setting limits, he would 
be correct 95 per cent, and in error 5 per cent, of the time. When 
we take M zt 1.96<rjif as confidence limitSy the degree of faith in 
such limits is represented by 95 per cent, or by a P of .95; i.e., 
the level of confidence for such an inference is represented by a 
probability-type figure of .95. If we wish to be surer of our infer¬ 
ences, we might choose the .99 confidence level. This, in practice, 
can be attained by taking M zb 2.58(rM as limits. 

Confidence limits, which are sometimes said to define a confi- 
dence intervaly are in their simpler applications very similar to 
fiducial limits. Both are methods for setting limits, and if the 
student should happen upon an elementary discussion of fiducial 
limits he might be inclined to believe that fiducial limits are noth¬ 
ing more than confidence limits under another name. There are, 
regardless of the similarity of outcome, basic differences between 
these two concepts. * 

Our discussion of confidence limits should help the student 
understand and appreciate the type of inference that is permis¬ 
sible regarding a population value. It should be obvious that, 
since (tm is a function of Ny it is possible to narrow the confidence 
limits without any loss in the degree of confidence with which we 
accept the limits. As we shall presently see, statistical inference 
is not restricted to the setting of confidence limits for population 
values—we can test experimental hypotheses, which is far more 
useful for most researches in psychology. 


STANDARD ERRORS 

The symbol om is to be read as ‘‘the sigma of the mean^^ or “the 
standard error of the mean^^; and, as previously stated, (tm approx¬ 
imates the standard deviation of a distribution of means from 
successive samples, i.e., the standard deviation of a hypothetical 
distribution rather than of an observed distribution. Instead of 
using the term standard deviation in connection with a sampling 

* 8ee Chapters 19 and 20 of Kendall, M. G., The advanced theory of statistics^ 
London: Charles Griffin and Co., 1946 (Vol. II). 



60 Sampling Errors and Statistical Inference 

or hypothetical distribution, it is customary to use the term stand¬ 
ard error, the implication being that the standard error tells us 
something about the possible magnitude of sampling or chance 
errors. Ordinarily the subscript of cr defines a sigma as being the 
standard error of some descriptive term. Thus ctq would be read 
as ‘^the standard error of the quartile deviation.^^ 

So far the concept of sampling errors has been discussed and 
interpreted in connection mth the mean. The other descriptive 
terms, defined and considered in Chapter 3, also show chance 
fluctuations from sample to sample. The extent of their sampling 
errors can be determined by use of proper standard-error formulas. 
The standard error of the mean has already been given as 


(tm = 


\/N 


(18) 


and the error of the median for normal distributions is 


^mdn — 


1.253(7 

Vn 


(19) 


A comparison of the standard error of the mean with that of the 
median indicates that the mean fluctuates less than the median; 
i.e., the mean is a more stable measure of central value than the 
median. In order to reduce the standard error of the median to 
the same magnitude as that of the mean it is necessary to take 
57 per cent more cases, i.e., increase N by 57 per cent. It follows 
from this that the use of the median for distributions which are 
reasonably normal in form is equivalent to throwing away a large 
proportion of the cases. 

For normally distributed variables, the sampling errors involved 
in measures of dispersion are 


(T .707cr _ 

~Vn~ 

.756 AD 

Vn 

1.166 Q 


( 20 ) 

( 21 ) 


(22) 



Attributes 


61 


From these error formulas it will be seen that, considering the 
error relative to the magnitude of the measures of dispersion, 
O’ is the most stable measure of variation. Providing N is 100 or 
more, the sampling distributions for these measures of dispersion 
are such that their standard errors can be interpreted in exactly 
the same way as the standard error of the mean. Thus the .99 
confidence limits within which the population standard deviation, 
&y is apt to be located will be given by <r =b 2,58(Ta; for example, 
for & of the IQ's of high school seniors, we would have 
15 ± 2.58(1.06), or 12.27 and 17.73 as limits. 

The standard errors for measures of skewness and kurtosis are 



These two formulas are based on the assumption that the sample 
has been drawn from a normally distributed population, and 
therefore they can be legitimately used in testing the assumption 
of normality. It mil be recalled that, for normal distributions, 
both gi and ^2 are equal to zero, but for a sample they may not 
be zero; however, sample values should not show a greater devia¬ 
tion from zero than can be reasonably attributed to chance. If a 
sample yields a gi value which is more than, say, 2.58 times its 
sampling error, one would suspect that the sample was not drawn 
from a symmetrically distributed supply. Likewise, if deviates 
from zero more than 2.58 times its standard error, one would 
question whether it is reasonable to believe that the population 
or supply is distributed with normal kurtosis. These are hints 
as to how hypotheses may be checked, a topic soon to be discussed 
in detail. 

Attributes. The statistical measures so far discussed are of use 
in describing frequency distributions for graduated variables, but 
it often happens that the research worker can classify individuals 
only on the basis of the presence or absence of a certain charac¬ 
teristic, which may be qualitative, or quantitative but not meas¬ 
urable in a graduated manner. When individuals are classified 
into categories on the basis of some characteristic or attribute, 



62 Sampling Errors and Statistical Inference 

it is usually desirable to reduce the frequencies to percentages. 
Thus we might observe that 40 per cent of 200 boys succeed in 
solving an arithmetical problem, or that 65 per cent of 100 white 
rats seem to prefer to go to the right rather than to the left when 
given a choice, or that 60 per cent of 300 delinquents come from 
broken homes. In each of the above examples, the given percent¬ 
age is based on a sample, presumably random, of a defined popula¬ 
tion, and again we are faced with the problem of making an 
inference from the sample value to the population value, i.e., 
from P to Py where P stands for the observed percentage and P 
for the percentage of the defined population who show the charac¬ 
teristic. If we were to take successive samples of size N and 
make a distribution of the observed percentages, the distribu¬ 
tion would center about P with a spread or standard deviation 
equal to the square root of i^(100 — P)/N, Since we seldom 
know Py we must use the observed percentage as a basis for 
determining its standard error. The standard error of a per¬ 
centage will be given approximately by 

(25) 

in which P = the sample or observed percentage, 

Q = 100 - P, 

N = the number of cases upon which P is based. 

We have already seen (Table 8) that the sampling distribution 
of percentages is skewed when P is small and N = 100. The same 
thing is true when P is large {Q must accordingly be small). This 
skewness is partly a function of the extremeness of P (or Q) and 
partly a function of the size of the sample. Rather than state 
that crp and the usual interpretation based on the normal curve 
are inapplicable for percentages more extreme than, say, 90 (and 
10) or 95 (and 5) or some other critical value, for such and such 
N^Sy it is necessary only to require that the number of cases in 
the smaller of the two categories be 5 or greater. However, it is 
not safe to use <tp for setting confidence limits for population 
values when extreme percentages are involved. 

If 40 of 200 individuals chosen at random from a defined popula¬ 
tion have had measles, we infer that the population percentage, 
Pj is likely to be between the limits P zh 2.58(7^, or 20 zb 2.58(2.8). 




63 


Comparison of Groups 

The standard error of a percentage can also be used to determine 
whether an observed percentage can be thought of as a chance 
deviation from some logically preassigned value. Thus the 
geneticist might expect on the basis of some genetical principles 
that 25 per cent of the offspring of a certain crossing would show 
a particular characteristic. He observes that 27 per cent of a 
sample of 300 show the characteristic. The deviation of 27 from 
25 divided by the standard error, 2.5, yields an xI(t of less than 1, 
from which it would be concluded that 27 might be a chance fluc¬ 
tuation from the expected percentage. If the observed percent¬ 
age had been 35, the x/<r of 4 would indicate that the geneticist^s 
observation was contrary to his original hypothesis. This is 
another example of the yet to be discussed use of statistics in 
checking hypotheses. 

The student will have noted that proportions, instead of per¬ 
centages, can be evaluated by means of formula (25). Let p = 
given proportion of cases, and g = 1 — p, then 

(25a) 

COMPARISON OF GROUPS 

One of the foremost problems in practical statistics is the com¬ 
parison of group trendy. We may wonder whether one racial 
group is superior to another, whether practice on a task increases 
mean performance, whether rats learn more rapidly w’^hen food, 
instead of water, is used as the incentive, whether reaction time 
to sound is faster than to light, whether the sexes show a dif¬ 
ference in variational tendency, whether a larger proportion of 
boys than of girls prefer the movies to a church festival, etc. In 
order to answer questions like the above, it is necessary that we 
make observations on samples from tw'o groups or on the same 
group imder two different experimental conditions, and then com¬ 
pute the sample means, medians, a’s, or percentages, as the case 
may be, for the variable or characteristic upon which \ve wish to 
make the comparison. Thus, typically, we have tw^o samples or 
two sets of scores or measures based on iVi and N 2 cases, with 
means Mi and ilf 2 , and sigmas <ri and < 72 , where the subscripts 
refer to the two different sets of measures. We have learned that 




64 Sampling Errors and Statistical Inference 

each mean is subject to sampling fluctuations; therefore in order 
to say whether a real difference exists between two population 
means, and ^ 2 } it is necessary to determine how large a dif¬ 
ference could arise solely as a result of sampling errors. 

Sampling distribution of differences. It has already been 
stated that successive sample means computed from normal or 
nearly normal distributions will be distributed normally about 
the population value, iO', with a standard deviation of (r/y/N- 
It can be shown that, if we have means based on normal or nearly 
normal distributions, the differences between successive sample 
means will also be distributed normally, and in particular that, 
if we take successive samples from two defined populations and 
each time determine the difference between the two sample means, 
Dm = Ml — M 2 y these differences between means will be dis¬ 
tributed normally about t)M (= — 1 ^ 2 )- The dispersion of 

the distribution will be approximately 

(Tdm = (26) 

in which r is a measure of the relationship between the two means 
when the means are not based on independent samples. Lack of 
independence would be the case if we had a mean initial score, 
followed by an experience, and then a mean final score for the 
same individuals; or, if in order to compare the performance of 
boys and girls, we chose the boys at random and then selected as 
our female group only those girls who had brothers in the male 
group; or if we chose at random individuals for an experimental 
group and then formed a control group by selecting only indi¬ 
viduals who could be paired on some basis with individuals in 
the experimental group; or if we had the same subjects working 
under two different experimental conditions. It will be noticed 
that we can have a correlational term in formula (26) only when 
it is possible to pair, on some basis other than chance, the scores 
which enter into the first mean with the scores which contribute 
to the second mean. If this is impossible, the (Correlation, r, be¬ 
comes zero when each score is paired with every other score, and 
the third term under the radical in (26) vanishes. 

If, for instance, we were to have the mean IQ for 200 Buffalo 
10-year-old boys and for 300 Los Angeles boys of the same age, 
there would be no way of pairing scores except at random. In 
this case the means are based on samples which have been drawn 



65 


The Null Hypothesis 

independently of each other, and under this condition the stand¬ 
ard deviation of the distribution of differences between means 
becomes 

_ IP p~ 

(^Dm = + TT ( 26 a) 

^Ni N2 

Formulas (26) and (26a) define the standard error of the difference 
between means for two distinctly different situations. Formula 
(26) is for the case in which the two means are based on scores or 
measures which are somehow related to each other, whereas 
(26a) is applicable when the two samples are absolutely inde¬ 
pendent. Since (26) implies paired scores, it follows that Ni = N 2 j 
whereas (26a) is not so limited. A method will be given later by 
which one can calculate the value of gom given by (26), without 
knowledge of how to calculate r. 

When from the context it is clear what sample values are being 
compared by way of the standard error of their difference, it is 
convenient to use the symbol gd instead of the more cumbersome 
GOii- As we shall soon see, one can have a gd for medians, for 
measures of variation, for percentages, and for other statistical 
measures. 

The interpretation of gd is the same as that of gm except that 
GD is the standard deviation of a hypothetical distribution of dif¬ 
ferences. Given a Dm, we can use gd to establish confidence 
limits for the unknown difference between the two universe means, 
i.e., for Dm = “ ^ 2 - If these limits include zero on the scale 

of differences, one can infer that the obtained difference could be 
a chance departure from a Dm of zero, i.e., that the found differ¬ 
ence is not statistically significant. 

The null hypothesis. A better rule-of-thumb procedure for 
testing the statistical significance of a difference is to set up the 
null hypothesis that there is no difference between the two popula¬ 
tion or universe measures (parameters). Then the argument 
riins as follows: if there is no difference between, say, illi and 1 ^ 2 , 
and if we were to repeat the investigation by drawing suc¬ 
cessive samples from the two universes, each time determining 
Dm = Ml — M 2 , these differences would be distributed normally 
about Dm = 0, with a standard deviation of gdm- (This gd in 
practice must be based on information derived from a single sam¬ 
ple from each of the two universes, hence is an approximation.) 



66 Sampling Errors and Statistical Inference 

By reference to Table A, we can specify the probability of secur¬ 
ing, by chance sampling, a difference as large as any given value. 
In particular, we can think of an obtained difference as a devia¬ 
tion from zero, express this as an x/c by taking D/ctd, and then 
ascertain how often as large a difference, irrespective of direction, 
would occur as a chance deviation from zero or no difference. If 
D/od is very large, say 4 or more, the probability of its occurrence 
by chance is so small that the null hypothesis would be rejected, 
and its rejection implies that we are justified in believing a real 
difference exists between the two universe means. In other words, 
if it is not reasonable to think that an observed difference is a 
chance variation from zero, we conclude that it is a variation from 
some value other than zero, i.e., that t)M is greater than zero and 
in the direction indicated by Dm- 

How large should D/o-d, sometimes called the critical ratio 
(C/2), be before the null hypothesis is rejected? There is no one 
answer to this question, although usage and convention would 
have us believe, for example, that a CR of 3.0 indicates statistical 
significance, whereas one of 2.9 does not, or that a CR of 2.0 
justifies the pronouncement that a difference has been established, 
whereas 1.9 does not permit such a statement. When one refers 
to the normal table (Table A), he sees that a difference as large 
as 3.0 times <rx> will occur about .003 times by chance, and that 
one as large as 2.9 times <td will occur about .004 times by chance. 
Although the risk of wrongly concluding that a difference exists, 
when in reality there is no difference, has increased by .001, this 
is hardly justification for rejecting a CR of 2.9 by demanding a 
value as high as 3.0 as a criterion of significance. Likewise, the 
difference between a CR of 2.0 and one of 1.9 is so small (P^s or 
probabilities of .045 and .057 respectively) that one begins to 
suspect that criteria as to what is significant and what is not are 
arbitrarily specified. 

When one sets the null hypothesis and considers the probability 
of obtaining a CR as large as 2.0 or 2.6 or 3.0, it is obvious that 
these CR^s represent varying degrees or levels of significance. A 
CR of 2.0 (more exactly 1.96 for large N^s) is associated with a 
significance level represented by a P of .05; i.e., a researcher would 
announce 5 experiments out of every 100 as showing real differ¬ 
ences even though no actual differences existed between the uni¬ 
verses involved in his comparisons. Such erroneous conclusions 



The Null Hypothesis 


67 


would be made once in 100 (P = .01) if a Z)/<rz) of 2.6 (more exactly 
2.576 for large iV^s) were accepted as the criterion of significance, 
whereas a significance level represented by a P of about .003 goes 
with a CR of 3.0. This last level is obviously a rigorous one, and a 
strict adherence thereto means that one can be reasonably sure 
of the dependability of the conclusion that a real difference exists. 

There are times when one might desire the assurance repre¬ 
sented by a P of .003 {CR of 3.0), but it should be noted that the 
acceptance of the null hypothesis because CR does not reach 3.0 
may lead too frequently to another type of erroneous conclusion. 
To understand this, we must consider what it means when the 
observed difference does not lead to a rejection of the null hypoth¬ 
esis. Acceptance of the null hypothesis does not prove that no 
difference exists. For example, a difference of .5 inch, in the mean 
height for two groups or samples, which yields a CR of .8, does 
not prove that there is no difference in the two universe values— 
it merely indicates that the real difference could easily be zero. 
But the obtained difference of .5 could be a chance departure 
from a real difference of .2 or .6 or. .8 or any of a whole series of 
values near .5. In other words, the null hypothesis is one which 
can be rejected but can never be proved; therefore to accept it 
too often because we insist on a high level of significance for rejec¬ 
tion means that we are too apt to overlook real differences. This, 
plus the fact that we don^t ordinarily need the assurance repre¬ 
sented by a significance level of .003, would suggest that a CR of 
3.0, which some have demanded as the arbitrary criterion of 
significance, is too high. 

At the other extreme, a few are willing to accept as significant 
a difference which is 1.5 times its standard error. Since P = .13 
for a CR of 1.5, it is readily seen that such persons would all too 
frequently have their publics believing that chance differences are 
real. A less lax level, which has had general acceptance by some 
workers, is represented by a P of .05, or a CR of nearly 2.0. This 
is also a rather low level of significance for announcing something 
as ^ffact.” Those writers who advocate the .05 level for research 
workers in psychology, sociology, and education cite R. A. Fisher, 
the world^s leading statistician, as their authority, but they fail 
to point out that Fisher^s applications are to experimental situa¬ 
tions wherein there is far better control of sampling than is ordi¬ 
narily the case in the social sciences. 



68 Sampling Errors and Statistical Inference 

Furthermore, one might be willing to tolerate a rather low level 
of significance for those research areas or disciplines wherein the 
exact repetition of investigations for independent verification of 
findings is the rule, yet hesitate to accept the same level in psy¬ 
chology and the other social sciences wherein, unfortunately, 
repetition is not the order of the day, either because it is too costly 
or because it is too routine to interest those who have been indoc¬ 
trinated with the idea that no study is worth while unless it gives 
one a chance to show *‘originality/’ If there is little likelihood 
of independent verification and if the findings of a study are to 
be used as the basis either for theory and further hypotheses or 
for social action, it does not seem unreasonable to require a higher 
level of significance than that which goes with a P of .05. We 
hasten to add the all-important point that there are factors other 
than statistical probability which need to be considered before a 
finding is accepted as having been established. Statistical method 
does not obviate the necessity for adequate controls in an investi¬ 
gation. Indeed, the statistical method presumes that a given 
batch of data has been collected in a dependable manner. 

The answer as to how large a CR should be, or what level in 
terms of probability should be adopted, in order to call a finding 
statistically significant is thus seen to be quite involved. There 
is the balancing of risks: that of accepting the null hypothesis 
when to do so may mean the overlooking of a real difference against 
that of rejecting the null hypothesis when doing so may lead to 
the acceptance of a chance difference as real. There is the ques¬ 
tion of the likelihood of independent verification, and, finally, 
there is the whim of personal preference: some individuals are 
more eager than others to announce a positive finding, i.e., a dif¬ 
ference as opposed to no difference, which is referred to as a nega¬ 
tive finding, whereas some prefer to be more conservative about 
drawing positive conclusions. It follows that no hard and fast 
rule can be given beyond that of interpreting a given finding in 
terms of the probability of its occurrence by chance and then not¬ 
ing whether the P is near the significance level which seems appro¬ 
priate when all factors are weighed. Since significance levels are 
on a sliding scale, there is nothing magic about a particular cri¬ 
terion of significance. 

The reader will have noted from the foregoing that the testing 
of hypotheses involves the possibility of two types of erroneous 



69 


The Null Hypothesis 

conclusions. These possible errors are usually referred to as 
type I and type II errors, which we shall now more specifically 
define. Consider again the null hypothesis that no difference 
exists between the means for two populations. If we reject this 
hypothesis when in fact it is true, we will have committed a type 
I error. If we accept the hypothesis when in fact it is false, we 
will have made a type II error. It should be obvious that the use 
of a lax instead of a stringent level of significance will tend to in¬ 
crease the possibility of making a type I error, whereas the use of 
a stringent level will increase the risk of making a type II error. 
When deciding whether to favor one risk over the other, one must 
consider from all possible angles the consequences involved in 
making an erroneous inference. 

If some reader must have a criterion regarding what is or is not 
significant, the author suggests that he compromise by taking 
the level indicated by a P of ,01 (or a CR of 2.58). One way out 
of the difficulty, so far as verbalization is concerned, is to say that 
a difference is significant at the .05, the .02, the .01, the .001, or 
whatever level it reaches. Since not all tests of statistical signifi¬ 
cance lead to CP^s or to {x/crYs interpretable from the normal 
curve, the habit of thinking in terms of significance levels, i.e., 
P^s, rather than CP^s should be cultivated. 

The principles involved in using and interpreting the standard 
error of the difference between means are applicable to a variety 
of statistical comparisons, such as the differences between medians, 
between standard deviations, and between percentages or pro¬ 
portions. The general pattern for the standard error of the dif¬ 
ference between two statistical measures based on two samples 
which are not independent is that of formula (26) with the sub¬ 
script M changed to mdn or (t or p or % or whatever symbol is 
appropriate for the measures to be compared. It will be noted 
that the <r/> for any two measures involves the square root of an 
expression which is the sampling variance of the given measure 
for the first group or set, plus the sampling variance of the given 
measure for the second set, minus twice the product of the two 
standard errors times an r. The exact value for the needed r is 
not known for certain comparisons, but for means it is known 
to be the correlation between pairs and for standard deviations 
it is the square of the correlation between pairs. The formula, 
usable when N is greater than 100, for the standard error of the 



70 Sampling Errors and Statistical Inference 

difference between two sample standard deviations is 


0^91 -h 92 — 2r^i2<raj<Ta2 

for which one needs to know r or how to compute it, there being no 
alternate formula such as that (to be presented shortly) for deter¬ 
mining the value of of formula (26) without r. 

The value of the standard error of the difference between 
standard deviations based on large {N greater than 100) inde¬ 
pendent samples is given by 

<rz>, = + ^ = •707<rx>« 

Also, for independent samples, 

^Dmdn — ^ ^ mdni I O’ mdn2 

and expressions for and (Tdq are similarly written. 

The comparison of medians, average deviations, or is not 
accurately possible for nonindependent samples because the proper 
value for the correlation in the r term is unknown. 

As an example of the use of the standard error of a difference, 
let us refer to the ever-recurring problem of sex differences in in¬ 
telligence, which is twofold: Is there a difference in mean perform¬ 
ance, and is there a difference in variation? Results obtained on 
the Stanford-Binet scale, for Berkeley children 6 years of age, are 
given in Table 9 (it is customary to attach to a measure its stand¬ 
ard error by the use of ±). 


TaUe 9, Sex Differences in IQ 


N 

M 


Dm .86±.71, DAd = 1.21 

Do .36 ± .50, D/aj) **» ,72 


Boys Girls 

795 779 

104.30 ± .49 105.16 =t .51 

13.93 .35 14.29 =t .36 


In order to ascertain whether the difference between means is 
greater than would be expected on the basis of chance, we first 
compute the standard error of each mean (.49 and .51), substitute 
these values in formula (26a) to obtain the standard error of the 



Difference between Correlated Means 


71 


difference between the means, and then determine D/ctd or 
.86/.71 = 1.21. From Table A we find that as great a difference 
as this would occur about 23 times in 100 as a chance fluctuation 
from zero; hence we conclude that these data do not tend to sup¬ 
port the notion that there is a sex difference at this age level. 
The question regarding a possible sex difference in variation can 
be answered by determining the standard error of the difference 
between the two The observed difference, .36, over its stand¬ 
ard error, \/.35^ + .36^, is only .72, from which we cannot con¬ 
clude that there is a difference in dispersion. 

Difference between correlated means. As another example, 
let us consider the first and second trial scores on a pursuit pendu¬ 
lum, given in Table 10. The mechanical scoring device was such 
that a decrease in score represents a gain. Can these data be 
taken as evidence that the second trial scores show a real, non¬ 
chance, improvement over the first trial scores? We can answer 
this question by evaluating the difference between the mean 
scores for the two trials. The means are 16.25 and 12.97, and 
the standard deviations are 5.94 and 5.75, respectively, for the 
first and second trials. In calculating the standard error of the 
difference between means, i.e., the cdm for Dm = 16.25 — 12.97, 
or 3.28, we should use formula (26) since the means are based on 



Table 10. 

Scores on 

A Pursuit 

Pendulum 


1st 

2nd 

D or 

1st 

2nd 

D or 

Trial 

Trial 

Gain 

Trial 

Trial 

Gain 

9 

4 

5 

17 

13 

4 

10 

5 

5 

12 

9 

3 

20 

14 

6 

19 

15 

4 

18 

10 

8 

12 

9 

3 

12 

12 

0 

25 

25 

0 

19 

14 

5 

18 

15 

3 

7 

6 

1 

21 

15 

6 

8 

12 

-4 

8 

7 

1 

31 

29 

2 

13 

15 

-2 

21 

10 

11 

19 

13 

6 

16 

13 

3 

13 

13 

0 

15 

13 

, 2 

14 

11 

3 

23 

24 

-1 

24 

20 

4 

12 

6 

6 

16 

10 

6 

13 

14 

-1 

13 

10 

3 

30 

22 

8 

12 

7 

5 



72 


Sampling Errors and Statistical Inference 

the same individuals, which fact must be allowed for by taking 
into account the correlation between the sets of scores. Thus 
we need to know r, but a standard error equivalent to that given 
by (26) can be obtained without determining r. This equivalent 
can be developed by simple algebra. 

Since this alternate to formula (26) is appropriate to situations 
other than that illustrated by first versus second trial scores, we 
will first attempt to indicate other specific situations where either 
method is applicable. Let Xi stand for a score which is somehow 
paired with another score that we designate as X 2 , and let N be 
the number of such pairs. Some of the pairing possibilities of 
Xi and X 2 as scores are as follows: 

а. Xi as first trial—practice— X 2 as later trial; same person. 

б. Xi as initial—experience— X 2 as final; same person. 

c. Xi as pretest—experience— X 2 as posttest; same person. 

d. Xi under experimental conditions vs. X 2 under normal (or 

control); same person. 

c. Xi in one experimental condition vs. X 2 in another; same 
person. 

/. Xi as experimental vs. X 2 as control; twin or sib pair. 

g. Xi as experimental vs. X 2 as control; unrelated persons, but 
matched. 

For the last-mentioned situation, which is commonly employed 
in experimental work, one can think of having drawn N indi¬ 
viduals at random for an experimental group, and then forming 
the control group by selecting individuals who can be matched 
with the experimental cases on the basis of variables which need 
to be controlled; thus any found difference between Mi and M 2 
will not be attributable to differences between the two groups in 
respect to the variables used in the matching. This same match¬ 
ing procedure, also twin or sib pairs, can be used for situation e. 
Furthermore, the Xi and X 2 scores can themselves stand for 
changes: Xi the change from pretest to posttest under experi¬ 
mental conditions and X 2 the change under another experimental 
condition or under control conditions. 

In general, we wish to test the significance of the difference 
between the mean of the Xi scores and the mean of the X 2 scores. 
Obviously, for some problems such a difference will be a change 
(gain or loss); for others it will be either the difference between 



Sampling Error via Difference Scores 


73 


the performances of two groups or the difference between changes 
shown by two groups. 

Sampling error via difference scores. The change for a given 
person or the difference between a pair can be expressed as 
2) = Xi — X 2 or D == X 2 — Xi, i.e., the subtraction can be 
made in either direction; which direction will depend upon the 
logic of the setup, but it must be the same for all N pairs of values; 
usually some D^s will be positive, some negative. Suppose for 
convenience we take D = X 2 — Xi, By definition of the mean, 
we have Md = ^D/N as the mean difference or mean change. 
Substituting for Z>, we have 

2(X2 - Xi) 

M, - — 

_ SX2 - SXi 
N 


_ SX2 SXi 

hence 

Md = M2 — Ml = Dm 

by which we see that the mean of the differences is equal to the dif¬ 
ference between the mearis. 

Let us next consider the standard deviation of the distribution 
of differences, i.e., of the (X 2 — Xi)^s. We first express the D's 
as deviations from their own mean, i.e., d = D — Md- Now 
D = X 2 — Xi and Md = M 2 Mi, so that 


d = (X2 - Xi) - (M2 - Ml) 


which, when the parentheses are removed and the terms shifted, 
becomes 

d = X2 ~ M2 - Xi + Ml 
or 

= (X2 - M2) - (Xi ~ Ml) 

Both these new parentheses terms define deviation units of the 
X = X — M type, so that 


d = X 2 Xi 



74 


Sampling Errors and Statistical Inference 


The standard deviation squared, or variance, of the differences 
can be expressed by substituting d for x in formula (5), thus 



If we replace d by its equivalent, we have 

2 S(a:2 — XiY 2x% 2x^1 22x2Xi 

j\r jr~ 


The first two terms are obviously the variances for the second 
and the first set of scores. The last term, involving the sum of 
the cross products of X 2 and the xi with which it is paired, has to 
do with the degree of correlation between, or similarity of, the 
scores which belong to the same individual or to the two indi¬ 
viduals of a pair. The reader is asked to take on faith, without 
further explanation here, the fact that the last term becomes 
2 ri 2 <ri<r 2 ; hence we can write 


or 


(T^d = “ 2ri20ricr2 


<rd 


= 4" <r^2 — 2ri2Cia’2 


Since the standard error of any mean is given by dividing the 
standard deviation by the square root of N, we can secure the 
standard error of the mean difference by dividing ad by \/N, i.e., 

ad Vgt^i -f- a^2 ~ 2ri2Vi0’2 

Vn ^ Vn 

_ l<^i <^2 2ri2<riff2 


The first two terms under the last radical are the sampling variances 
of the two means, and since 2 ri 2 aia 2 /N can be written as 


we have finally that 


2ri2 


VnVn 





Difference between Independent Proportions 75 

or that <tmd ~ that the standard error of the mean difference 

(or change) is equal to the standard error of the difference between 
the two means. This is entirely logical since Mb — Dm- 

Thus, by working with the difference between paired scores, 
we can obtain the standard error of the mean difference (= differ¬ 
ence between means) without computing r. Even after we have 
learned how to compute r, it matters not whether we compute 
the standard error of the difference between means of related 
samples by formula (26) or whether we compute its equivalent, 
the standard error of the mean of the differences, by dividing the 
standard deviation of the distribution of differences between 
paired scores by the square root of N, 

For the data on the pursuit pendulum (Table 10), we compute 
all the differences, 9 — 4 = 5, 10 — 5 = 5, etc., and find the 
mean of the differences to be 3.28, and the standard deviation of 
the differences or gains to be 3.15. Then 3.15 divided by the 
square root of or 32, gives .56 as the standard error of the mean 
difference. The ratio of the mean difference to its standard error 
is 5.9, and this is exactly equivalent to the ratio of the difference 
between the two trial means divided by the standard error of the 
difference obtained by using formula (26) with the proper value 
for r. Since a critical ratio of 5.9 is considerably larger than 2.58, 
we feel safe in concluding that improvement did take place. If 
we had used (26a), we would have had a (td of 1.46 and a critical 
ratio of only 2,25. 

DifFcrcncc between independent proportions. Because pro¬ 
portions or percentages are frequently the basis for describing and 
comparing groups with regard to qualitative characteristics, some¬ 
times called attributes, the sampling treatment thereof merits 
more discussion than the mere giving of a formula for the standard 
error of a single percentage. Our treatment, however, mil be in 
terms of proportions; any reader who prefers percentages may 
simply shift decimal points two places to the right. We are con¬ 
cerned here only with that type of proportion which results when 
a subfrequency is divided by a total frequency. Thus p stands 
for the proportion of individuals in a sample who exhibit a given 
characteristic, and q stands for the proportion not showing the 
characteristic; p + g = 1, always. In some cases, this may be a 
straight dichotomy, as yes or no, male or female, passing a test 
item as opposed to failing, like versus dislike, delinquent or non- 



76 


Sampling Errors and Statistical Inference 


delinquent, etc. In other cases, the dichotomy may be imposed 
by placing one type of response against a number of others which 
are grouped together as nonfirst type. Thus p might be the pro¬ 
portion of a group who are in the only-child category, while q 
represents all other possibilities. 

The sampling error of p is the same as that of as is evident 
from the error formula, '^/pqlN, which, it will be recalled, is an 
approximation in that the exact value depends upon the unknown 
p or ^ of the universe. For the comparison of independent sam¬ 
ples, the standard error of the difference between two proportions. 
Pi and p 2 j follows the usual pattern: 



or the equivalent 



This form is defensible when both Ni and N 2 are fairly large, 
but when the are less than, say, 100, and particularly when 
either or both of the sample proportions are extreme (smaller 
than .10 or larger than .90), a better procedure is to base the 
sampling error on the proportion of the two groups combined 
who show the given characteristic. This is entirely consistent 
with the null hypothesis, which assumes no difference between 
pi and p 2 ^ If there is no difference between the two universe 
values, the variability of the random sampling distribution of the 
difference between proportions can be best estimated by 



in which pi and p 2 of formula (27) are replaced by p, the propor¬ 
tion for the two groups combined. Since p as here defined is 
based on Ni + N 2 cases, it will in general be a better or more 
stable estimate of the unknown p than will pi or p 2 ; hence (27a) 
is a better approximation than (27) and is therefore to be pre¬ 
ferred. The difference in the results obtained by (27) and (27a) 
is appreciable when the sample sizes differ, say, when one N is 
two or three times as large as the other, and when one of the 
samples yields a proportion as extreme as .98 or .99 (or .02 or .01). 



Correlated Proportions 77 

Form (27a) helps overcome the obvious difficulty of formula (27) 
which arises when the proportion for one sample is 1.00 or .00, 
leading to one zero term under the radical. This implies that 
the given proportion of 1.00 or .00 is not subject to sampling 
error; such would be the case only if the universe proportion were 
1.00 or .00. 

The use of p in lieu of pi and p 2 when testing the significance of 
the difference between pi and p 2 does not overcome the difficulty 
of skewed sampling distributions for extreme proportions. Since 
the extent of skewness depends in part upon the extremeness of 
the population proportion and in part upon the size of a given 
sample, a criterion for ascertaining when it is unsafe to use the 
standard error of the difference between proportions should 
logically take both these factors into account. Let q stand for 
the proportion of the combined group who fall in the smaller of 
the two categories; i.e., q is assigned to the smaller of the two 
proportions (or q is less than p), and let Na stand for the size of 
the smaller of the two samples; then a mle-of-thumb criterion is 
that qNa should be equal to or greater than 5 before it is safe to 
compare the groups by the D/ctd technique. 

Correlated proportions. Formulas (27) and (27a) are appli¬ 
cable only when the two samples have been drawn independently 
of each other. If the individuals in one group were somehow paired, 
as sibs or by matching, with the individuals in the second group, 
a subtractive correlation term would be needed in the formulas. 
Such a correlation term is also required in connection with the 
commonly used setup in which a group of individuals give yes or 
no responses and then, following an interpolated experience such 
as a movie or lecture, give a second response. The purpose, of 
course, is to determine whether a shift in responses has taken place 
as a result of the interpolated experience. If pi is the proportion 
of yeses given in the pretest and p 2 the proportion given in the 
posttest, we need to evaluate the significance of the difference, 
£) = Pi — p 2 , in order to learn whether or not the difference is 
attributable to chance. A well-controlled investigation would 
obviously require a control group who also give two sets of re¬ 
sponses but do not have the interpolated experience. If De repre¬ 
sented the shift for the experimental group and Dc the shift for 
the controls, the net difference between the two shifts would need 
to be tested for statistical significance. In evaluating one change 



78 


Sampling Errors and Statistical Inference 

or in evaluating the difference between changes, we need a formula 
for the standard error of the difference between proportions which 
makes allowance for the fact that the pretest and posttest propor¬ 
tions are based on the same individuals. Both the standard error 
of the difference for the experimental group and that for the con¬ 
trol group must be obtained by a formula which includes a corre¬ 
lational term. 

Parenthetically, it may be said that the setup which involves 
an experimental and control group for studying shifts has’led to a 
great deal of confusion as to the proper statistical handling of the 
data. We have a total of four proportions, a pretest and a post¬ 
test for each of the two groups. By using a combination of sub¬ 
scripts, 1 and 2 for the pretest and posttest, and E and C to repre¬ 
sent the two groups, we can specify the proportions as 'P 2 Ey 
Pie, and p 2 e- Not all the possible differences between these four 
will have meaning. Those that have meaning may be set forth as: 

De = Vie — V 2 Ei the change shown by the experimental group. 

E>c = Vic V 2 Cy the change shown by the control group. 

J^i = Vie ViCi the pretest difference between experimentals 
and controls. 

I >2 = V 2 E — V 2 Cy the posttest difference between experimentals 
and controls. 

Which of these four meaningful differences should we test for 
significance? Obviously, it is insufficient to test only De because 
we canT be sure that the shift shown, even though nonchance, is 
really due to the interpolated experience. In fact, the reason for 
the control group is to enable us to evaluate the shift which takes 
place as a result of causes other than the experimentally provided 
experience. Now it might be thought that, if De is significant 
while Dc is less, or not at all, significant, an effect has been demon¬ 
strated. This type of comparison, however, does not provide a 
check on the net change. Some have argued that, if D 2 is signifi¬ 
cant while D\ is not, one can safely conclude that the interpolated 
experience has had an effect. This comparison also fails to test the 
net change. We should test the significance of the difference 
between the two changes, i.e., Z> = — Dcy in order to gauge 

properly the net shift. Although, as regards absolute magnitude, 
De — Dc will always equal D 2 Di, it is easier to evaluate the 
former difference. The standard error of D (= — Dc) can 

be ascertained as soon as we have the proper formula for getting 



79 


Correlated Proportions 

the standard error of Dej the difference between proportions based 
on the same individuals. The formula for the standard error of 
De will obviously be applicable for determining the sampling error 
of Dc. The standard error for the difference, De — Dc, will 
follow the usual pattern: 

o'Djt) = + <^^Dc 

which does not involve an r term because it is assumed that the 
experimental and control groups have been drawn independently. 
(The discerning student will have noted that the arguments of 
this paragraph will also hold when means instead of proportions 
are available for experimental and control groups.) 

We now return to the problem of the standard error of the dif¬ 
ference between two proportions based on the same sample, i.e., 
D = Pi — p 2 . Before we give the formula, the type of tabula¬ 
tion needed should be considered. It would seem easy enough 
simply to count the pretest yeses and divide by N to get pi, and 
then to count the posttest yeses and divide by N to secure p 2 > 
but in order to get the standard error of the difference we need a 
four-way tabulation. The schema is set forth in Table 11. If 
an individual gave a yes response the first time and a yes response 
the second time, a tally mark would be made in the upper right- 
hand cell; a yes at first, followed by a no, would go in the upper 
left quadrant, and so on. Let A, 5, C, and D represent the respec¬ 
tive frequencies for yes-no, yes-yes, no-no, and no-yes responses. 
Then A + B is the total number of yeses on the first or pretest, 
and B + D is the total number of yeses on the second or post¬ 
test. 


Table 11. Tabulation Plan for Handling Proportions Based on the 

Same Individuals 


Frequencies 


Proportions 


2nd 


2nd 


Yes 


1st 

No 


No Yes 


A 

B 

C 

D 


A+B 
C +D 


Yes 

1st 


No 


No Yes 


a 

h 

c 

d 


P2 


A + C B + D N 


92 


1.0 




80 


Sampling Errors and Statistical Inference 


Next, let's divide each of the frequencies in the quadrants and 
on the margins by N, This will give the table of proportionate 
frequencies to the right, with a, b, c, and d as the cell proportions 
and p's and ^'s along the margins. Thus pi equals the proportion 
of pretest yeses, and p 2 the proportion of posttest yeses. A stand¬ 
ard error of the difference which is equivalent to formula (27), in 
that it involves pi, ^i, p 2 , and g 2 , but differs in that the needed 
correlational term has been included, though not visibly so, is 
given by 


<^Dp = 


4 


(d + a) -- (d — a)^ 


N 


(28) 


which can be easily computed without knowledge of correlation. 
A formula based on one proportion, p, in this case the average of 
Pi and p 2 (since each is based on the same iV), and analogous to 
formula (27a) and therefore consistent with the null hypothesis, 
is the following 


(^Dp = 


4 


a -f" a 

IT 


(28a) 


which also allows for the correlational term. Formulas (28) and 
(28a) have a type of restriction, regarding extreme proportions, 
which differs from that imposed on (27) and (27a). The rule-of- 
thumb criterion of applicability is that A + D should be 10 or 
greater. The reason for this will be discussed in Chapter 11. 

Regardless of which formula, (27), (27a), (28), or (28a), is used, 
the difference between the two proportions divided by its stand¬ 
ard error can be interpreted as any other D/crj), Formula (28) 
or (28a) must be employed when the two proportions being com¬ 
pared are based on the same individuals; to use either (27) or 
(27a) for such a situation is to use a standard error which is too 
large, thus leading to an underestimate of the significance of the 
changes in response. Formula (28) or preferably (28a) is the cor¬ 
rect one for determining (tde s^nd (tdc when comparing the differ¬ 
ence in changes for two groups. 

We may illustrate the foregoing by some data collected by 
Carl Hovland and his associates, while with the Research Branch 
of the Army's Special Services, on the effectiveness of a radio talk 
which aimed, in part, to overcome an apparent overoptimism of 
soldiers concerning the length of the war with Japan. The indi- 



Correlated Proportions 


81 


viduals in an experimental and a control group were asked to 
indicate how long they thought the war would last. When the 
answers were dichotomized as ‘‘more than a year” and “a year or 
less,” it was found that the proportion in the “more than” cate¬ 
gory was .635 for the controls and .658 for the experimental group. 
These pretest results are to be compared with the respective pro¬ 
portions, .669 and .824, for the posttest responses. Between the 
two askings of the question, the experimental group heard the 
radio talk. Now the shift shown by the controls can be taken as 
an estimate of change which took place under “normal” condi¬ 
tions, whereas that shown by the experimentals reflects this same 
normal shift plus that ascribable to the radio talk. 

In order to evaluate the net shift, i.e., the difference be¬ 
tween the two changes or Dd = De -- Dc = (.824 — .658) 
— (.669 “ .635) = .166 — .034 = .132, we need the standard 
errors of De and Dc. These can be ascertained by using formula 
(28a) provided the data are available for making the fourfold 
tables. These data are given in Table 12, which contains fre¬ 
quencies and proportions according to the pattern of Table 11. 


Table 12, Fourfold Tables for Comparing Changes in Length-of-War 
Estimates (-f for 1 Year or More and — for Less Than 1 Year) 


control group 

Frequencies 

Posttest 


Proportions 

Posttest 


Pretest 


Pretest 


4- - -f- 


+ 

8 

107 

115 

.044 

.591 

- 

52 

14 

66 

.287 

.078 


60 

121 

181 

.331 

.669 


EXPERIMENTAL GROUP 

Posttest PosUest 

+ - -f 


+ 

0 

135 

135 

.000 

.658 

- 

36 

34 

70 

.176 

.166 


.176 .824 


.635 

.365 

1.000 


.658 


.342 

1.000 


36 


169 205 




82 


Sampling Errors and Statistical Inference 


For <tde we have \^(.166 + .000)/205 or .029, and for (td^ we get 
\/(.078 + .044)/181 or .026. Then for the standard error of 
the difference between the two differences (changes) we have 
V(.029)2 + (.026)2 Qp 039; thence CR = .132/.039 or 3.38, 
which is highly significant. If we had used formula (27) instead 
of the proper formula (28a) for calculating the standard errors 
of De and Z>c, we would have obtained a 0 * 2 )^ of .066 and a CR of 
only 2.0. 

It should be pointed out that formula (28a) is the correct one 
to use in testing the significance of the difference between the 
difficulties of two test items which have been administered to the 
same group, and for testing the difference between the responses 
to two questions which have been asked of the same individuals. 
Ordinarily, the “scores’’ on two test items will show some correla¬ 
tion, but whether the responses to two questions will be correlated 
depends upon the nature of the questions. Formula (28a) is also 
the proper one to use in connection with the setup in which we 
wish to compare the responses of two groups that have been formed 
by matching individuals. For this situation, the fourfold table 
would show the number of pairs the members of which give the 
same responses, and the number of pairs the members of which 
give different responses. 

CONFIDENCE LIMITS FOR DIFFERENCES 

Although the simplest rule-of-thumb procedure for checking 
the significance of a difference is to ascertain D/ctd, the problem 
of significance, as previously indicated, can be ascertained by 
means of confidence limits. All we need to do is to take the ob¬ 
tained difference ±2.58ai) in order to set limits at the level of 
confidence represented by P = .99. If these limits overlap zero 
on the scale of possible differences, significance at the .01 level 
cannot be claimed. There are times when it is desirable not only 
to know whether a difference is significant but also to specify 
limits for the population difference. Such a step does not presume 
that a significant difference exists. Even when a difference fails 
to reach significance, the specification of confidence limits gives 
one some idea of the possible difference between the population 
values, and such information may help answer the nonstatistical 



83 


Simpler Hypotheses 

question of whether the population difference is apt to be large 
enough to be of practical or scientific importance. This procedure 
may be helpful, especially when large samples are involved, in 
evaluating the consequences of accepting the null hypothesis when 
the hypothesis is in reality false. 

The use of confidence limits may also be particularly helpful 
when we have obtained a difference which is highly significant. 
Consider the case of a difference of 4.78 inches in mean height 
between men and their sisters. Because of large N^s and the 
presence of brother-sister correlation, the standard error of the 
difference is very small. Its value is about .07. When we com¬ 
pute the D/<td we have a critical ratio of 68. This would, if we 
could evaluate it, yield a probability, for as large a difference by 
chance, which would be so microscopically small that we could 
not comprehend it. However, if we set confidence limits at, say, 
the .99 level, we would have 4.78 ± 2.58(.07), or 4.60 and 4.96, 
as limits for the population difference. This permits a down-to- 
earth way for evaluating the obtained difference. 


SIMPLER HYPOTHESES 

Frequently we may desire to determine the significance of the 
deviation of a value from some a priori value. This is somewhat 
simpler than testing the significance of a difference in that a <td 
need not be calculated;' all we need is the standard error of the 
mean, proportion, or whatever descriptive constant we have. 
For example, if we question whether a given degree of skewness 
differs significantly from zero, all we need to do is to regard the 
obtained gi as a deviate from zero, and divide the deviate by o-gj. 
Consider the g\ of .028 for the distribution of 2970 Stanford-Binet 
IQ’s. Its standard error by formula (23) is .045 and the x/a 
becomes .028/.045 or .62, from which we conclude that the skew¬ 
ness fails to reach significance. For the same distribution, the 
kurtosis (^ 2 ) is .346, which when divided by its standard error 
obtained by formula (24) gives an x/cr of 3.8. Since the probability 
of obtaining as large a deviation, in either direction, is about 
.0001, it can be said that the deviation of the kurtosis from normal 
kurtosis is significant at the P = .0001 level. 

Simpler hypotheses involving percentages occur rather fre¬ 
quently. • We may wonder whether the percentage passing an 



84 Sampling Errors and Statistical Inference 

item deviates significantly from 50 per cent (medium difficulty), 
or we may wish to know whether rats turn to the right (or left) 
more frequently than expected on a chance basis (50-50), or we 
may wish to determine whether an individuars choices differ 
significantly from those expected on a chance basis (can an indi¬ 
vidual successfully pick a particular brand from among four 
brands?). In all such cases it is not only appropriate but also 
proper to use the a priori or expected proportion in the formula 
for the standard error of a proportion. 

REDUCTION OF SAMPLING ERRORS 

One of the aims of scientific method is to attain as great precision 
in results as is practicable. In statistical work this can be accom¬ 
plished by increasing the accuracy or dependability of the re¬ 
sponses or scores or individual measurements and by decreasing 
the chance sampling errors of the various statistical averages or 
descriptive measures. One way to reduce sampling errors is to 
employ the stratified sampling method which, since it is rather 
complicated, cannot be discussed in this introductory chapter on 
sampling. If the random sampling method is being used in proj¬ 
ects which aim to study the differences between groups, such as 
city vs. rural, one city vs. another, one species vs. another. Repub¬ 
licans vs. Democrats, the obvious way for decreasing the standard 
error of the difference is to increase N for either or both samples. 
In fact, if the random method is used, this is the only way of 
increasing the precision of statistical measures and their differ¬ 
ences. Most field investigations are of this type. 

In contrast, the experimentalist can define his population with 
reference to two different laboratory or experimental situations, 
i.e., a population of individuals under situation A and a popula¬ 
tion of individuals under situation B; his sample individuals for 
the two situations may be the same individuals, first under the 
A and then under the B condition. In general, the use of the 
same individuals, if feasible, will result in some degree of correla¬ 
tion, the net effect of which is to reduce the standard error of the 
difference; i.e., it is sometimes possible to reduce sampling errors 
simply by using the same individuals in the ‘‘two’’ samples. Thus, 
if we wish to study the effect of different types of ventilation on 
mental output or efficiency, it will be a more economical and better 



85 


Reduction of Sampling Errors 

controlled experiment if we make observations on the same indi¬ 
viduals under the two conditions A and B, rather than on iVi 
individuals under condition A, and N 2 individuals under condi¬ 
tion B. 

If it is not feasible to use the same individuals in the two experi¬ 
mental situations, we can make up two groups by pairing indi¬ 
viduals on the basis of one or more characteristics, such as age, 
sex, intelligence, socioeconomic background. Such a procedure 
leads to more nearly comparable groups for our experiment than 
can be obtained by choosing individuals at random, and by using 
formula (26) instead of (26a) we can make allowance, by means 
of r, for the fact that the individuals for the two samples have 
not been chosen independently. For situations A and B, the use 
of individuals who have been paired is considered good experi¬ 
mental technique—^it cannot be said that a found difference be¬ 
tween the means for the variable being studied may be due to a 
lack of comparability of the two groups with respect to the varia¬ 
bles used in forming the pairs. The use of paired individuals has 
a statistical as well as an experimental advantage in that the sam¬ 
pling error of the difference between means is thereby reduced 
without the necessity of increasing the number of cases. It is 
sometimes possible by pairing individuals to produce an r high 
enough to reduce the standard error of the difference by one-half; 
to get such a reduction by increasing the size of our samples would 
require quadrupling the number of cases. It should be noted 
that, although we have been speaking of r as reducing the standard 
error of the difference between means, careful pairing will also 
reduce the standard error of the mean difference, since (tmd ~ 
as computed by formula (26). 

It is thus seen that, for some types of investigations, greater 
precision can be obtained by judicious planning. If one had un¬ 
limited resources, he could always secure any desired degree of 
precision by simply taking sufficiently large samples. For exam¬ 
ple, if one wished to specify a mean with an accuracy represented 
by a (Til/ = .1, he could first get an estimate of the required stand¬ 
ard deviation either from previous studies of the given variable 
or by taking a sample of 100 cases. Suppose the latter yielded a 
<7 of 10 points. Then one could determine the value of N which 
would make <tm = o-/\/A = .1, or 10/VA = .1. Squaring both 
sides and solving for A will show that 10,000 cases are needed for 



86 Sampling Errors and Statistical Inference 

the desired accuracy. This is an approximation; since the c of 10 
for the preliminary sampling is subject to error. One would have 
greater assurance of attaining the needed precision if he took the 
obtained of 10 and added to it twice its own standard 
error as a likely upper limit for the universe <r) i.e., 10 + 2acrj or 
10 + 2(10/\/^), which equals 11.41, would be used as the <r 
in the relationship c/y/N = .1 for determining N, This leads 
to an N of 13,018 in order to be reasonably sure that (tm would 
be as small as .1. It might be smaller, since the value of the 
standard deviation for 13,018 cases could easily be less than 11.41, 
perhaps less than 10. As an exercise, the student can set forth 
the steps required in deciding in advance the size of iV needed to 
attain a given degree of precision for a proportion or percentage. 

Frequently the question is raised as to how many cases should 
be secured for a given study. The answer might be in terms of 
the number needed to reach a given degree of accuracy, but this 
in turn would only raise the question of what degree of precision 
is needed. There is no ready-made answer to the latter question. 
It may be of interest to note that, when group comparisons are 
being made, and when the are relatively small, the null hypoth¬ 
esis is apt to be accepted too often for the simple reason that a 
real difference has to be sizable before it is demonstrable by small 
samples. On the other hand, if a real difference is so small that 
its statistical demonstration requires thousands of cases, one may 
question whether it has practical or scientific importance. This 
raises another issue: Does statistical significance always mean 
scientific or practical significance? We leave this question for 
the student to ponder. 

A sample, as the term implies, should be representative, except 
for chance, of the population from which it is drawn. The tech¬ 
niques for obtaining a random representative sample vary from 
discipline to discipline, and about the only general suggestion 
that can be made is: carefully and critically analyze the proposed 
method in order to determine whether selective factors could in 
any way operate to prevent randomness. Occasionally, on the 
basis of variables other than the one being studied, it is possible 
to indicate the bias of our sample; i.e., we can conclude that a 
sample is not representative, but whether a sample is really repre¬ 
sentative is difficult to establish. A currently advocated, though 
erroneous, check on representativeness is either (1) to take addi- 



Finite Universes 


87 


tionskl sfliDiples, compute the medn, and if the means of these suc¬ 
cessive samples fall within the lunits of from the original 

sample mean, conclude that the original sample is representative; 
or (2) to add more cases to the original sample and note the effect 
on the mean—if it changes little, conclude that the original sample 
is representative. The fallacy in so testing the representativeness 
of a sample lies in the fact that the additional samples or addi¬ 
tional cases, if obtained by the same sampling method as the 
original, will be subject to the same selective factors, and there¬ 
fore agreement with the first sample merely indicates that the 
experimenter was successful in following the same, perhaps biased, 
sampling procedure each time a sample was drawn. Consistency 
is a necessary but not a sufficient condition for establishing the 
trustworthiness of a result. 


FINITE UNIVERSES 

A universe is said to be finite when there is a limited number of 
individual units therein and infinite when the number is unlimited. 
The standard error formulas which we have been considering are 
based on the assumption that we are drawing samples from infi¬ 
nite universes. If we are sampling from a finite universe, par¬ 
ticularly a universe with a rather small number of cases, it seems 
reasonable to think that as the sample size becomes large relative 
to the number of cases in the universe the sample mean, for exam¬ 
ple, will tend to fluctuate less from the universe mean than is the 
case when drawing from an infinite population. This suggests 
that the standard error formulas need to be modified. The 
required modifications are available for only a few statistical 
measures. If we let N represent the sample size and ^ the size 
of the finite universe, the standard errors for the mean and for a 
proportion are as follows: 

In a given research it is sometimes difficult to decide whether 
the universe being sampled is finite or infinite, and, if finite, it is 
not always easy to determine the value of It might be argued 
that psychologists never study an infinite universe. It can readily 
be seen that the corrective factor in the sampling error formulas 



88 Sampling Errors and Statistical Inference 

becomes negligible as ^ becomes large. Thus, if ^ is known to 
be large relative to N, it matters little whether the given universe 
is wrongly conceived as being infinite. For example, when N is 
.01 of the term N in the above formulas leads to a reduction 
in the sampling error of about .005 of the value obtained by the 
ordinary formulas. 

These formulas for the finite universe situation are frequently 
useful when we wish to compare a subgroup with a total group 
which contains the subgroup. Such a compari son is sometimes 
erroneously made by taking y/a^tlNt + as the standard 

error of the difference between the subgroup mean, and the 
total mean, Mf This makes no allowance for the fact that the 
two means are not based on independent groups. An appropriate 
procedure is to regard Af, as based on a sample drawn from a 
finite universe of Nt cases with mean and standard deviation of 
Ml and at] then with the standard error of taken as 



we can test the significance of the deviation of from Mt by 
using the ratio (Af, ~ Mt)l(TM^j which is interpretable as a critical 
ratio. This ratio will give a very close approximation to the 
critical ratio which would be obtained if we were to compare the 
subgroup with the remainder group (the total cases less the sub¬ 
group cases) as two independent groups. The standard error of 
the difference between the means for the subgroup and the re¬ 
mainder group would be determined by the usual formula for 
independent samples. 

The argument of the preceding paragraph also holds for the 
comparison of a proportion or percentage for a subgroup with 
that for the total group. 

SAMPLING FROM SKEWED DISTRIBUTIONS 

Earlier in this chapter (pp. 51-52) it was stated that the sampling 
distribution of the mean is normal when the samples have been 
drawn from a normally distributed population or from moderately 
skewed populations. The precise relationship between the degree 
of skewness, gu for the trait or variable and the amount of skew- 



Note on the Probable Error 


89 


ness for the sampling distribution of means is qm = gi/y/N. 
Thus the skewness of the means rapidly disappears as N is taken 
larger and larger. For example, if gi is .77 (see Fig. 6) and N is 
35, the skewness for the sampling distribution of means will be 
only .13 (see Fig. 6). 

NOTE ON THE PROBABLE ERROR 

A needless and antiquated procedure is the use of the probable 
error instead of the standard error in connection with sampling. 
The pe of the mean is .67450-^, and therefore we would expect 
50 per cent of successive sample means to fall between ]Q[ db peu- 
Similarly, the pe for other statistical measures is .6745 times the 
standard error. Since no additional information is yielded by 
multiplying the standard error by a constant, the continuance of 
this nuisance practice is being discouraged. The student who 
attempts to survey the research literature on a given topic is apt 
to encounter pc^s, and he therefore must know the relationship of 
the pe to the standard error. 



CHAPTER 6 


Correlation: Introduction and Computation 


One of the chief tasks of a science is the analysis of the inter¬ 
relations of the variables with which it deals. In the physical 
sciences, and frequently in the biological sciences, the interrela¬ 
tions can be determined by noting how much of a change in one 
variable is associated with change in another. The physicist 
studying the relationship between pressure exerted by a gas and 
temperature can vary the latter at will so as to determine the 
pressure at different temperatures. In the social sciences, and 
sometimes in the biological sciences, the variables studied are apt 
to be characteristics of individuals (plant or animal); thus to 
study relationships the experimenter is compelled to make meas¬ 
urements on several individuals. For example, if two variables 
such as height and weight are under consideration, the measured 
height and weight of N individuals will provide N pairs of observa¬ 
tions from which it can be determined whether the two vary 
together. In either case it is important to determine the form 
(mathematical) of the relationship and the accuracy with which 
one can make predictions. The mathematical form of relation¬ 
ships between biosocial variables is usually quite simple, but 
prediction is nearly always subject to large and serious error. 

Many of the relationships are expressible in terms of the sim¬ 
plest of all mathematical forms, Y = A + BX, in which X and 
Y represent variables and A and B are constants determinable 
from the observations. The disturbing fact is the lack of accuracy 
involved in predicting Y from X in individual cases. The accuracy 
of prediction can, of course, be determined from the data, and it 
is convenient that we have some general measure of this accuracy. 
One such measure which can be computed and which will yield 
information as to the degree of accuracy and the degree of relation- 

90 



The Scatter Diagram 


91 


ship is the correlation coefficient, designated r. This measure of 
co-relation, as we shall soon see, not only tells us the degree of 
relationship, but will also, in conjunction with the two means and 
standard deviations, permit us to write the linear equation for 
predicting Y from X or X from F. 

Our present discussion will be concerned with the determination 
of relationship between such typical variables as height, weight, 
strength, age, intelligence, social status, attitudes—^i.e., with those 
variables which show variation from individual to individual. 
The question of the relationship between variables of this type 
can be stated quite simply: Is there a tendency for the individual 
who ranks high (or low) on one characteristic to be high (or low) 
on another also? It has already been stated that in order to 
determine the relationship between two variables it is necessary 
that we have pairs of observations, i.e., two measures on each of 
several individuals. 

THE SCATTER DIAGRAM 

The first reducto-descriptive task is tabulation. If we have 
observations on the height and weight of a large number of indi¬ 
viduals, using cross-sectional or coordinate paper, we can lay off 
on the y axis convenient tabulating intervals for, say, height and 
on the X axis intervals for weight. The rules for choosing intervals 
stated on p. 6 should be followed here. Tabulation then consists 
first of finding on the y axis the interval in which an individual's 
height falls and locating the interval on the x axis for his weight. 
A tally or dot is then placed in the cell formed by the intersection 
of these two intervals. The result of such a two-way or cross 
tabulation is referred to as a scatter diagram or correlation table. 
It will contain as many tallies as there are pairs of observations. 
The tallies in each row, or horizontal array, can be counted and 
recorded, separately by rows, to the right of the diagram. This 
procedure will, of course, yield the frequency distribution for all 
individuals with respect to the variable on the y axis. A similar 
count, and recording at the top, of tallies for each column, or 
vertical array, will yield the distribution for the other variable. 
The sum of the frequencies for either of these marginal distribu¬ 
tions should equal N, or the number of pairs of observations. 



92 


Correlation: Introduction and Computation 


Figures 11 and 12 are representative scatter diagrams, but not 
models so far as number of grouping intervals is concerned. (In 
practice, from 12 to 20 intervals should be used in order to reduce 
the grouping error to a negligible amount.) The student should 
study these diagrams so as to grasp some of the mechanical details 
involved in their construction. It should be noted that the num¬ 
ber and size of the intervals for the two variables need not be the 
same, and that the zero points on the scales of measurement need 
not appear or even be indicated on the axes. 

It can readily be seen that these two diagrams represent dif¬ 
ferent degrees of relationship. A precise method for measuring 
or describing degree of relationship or association or correlation 
will be discussed in detail in the pages to follow. We shall begin 
with a symbolic definition of the Pearson correlation coefficient, 
indicate its computation, and then discuss its meaning, interpreta¬ 
tion, assumptions, and finally its limitations. Certain elementary 
mathematical derivations will be either indicated or given when¬ 
ever it is thought that their inclusion will be useful in clarifying 
a point or clinching an assumption. 

The Pearson 'product moment correlation coefficient is defined by 


r = 


NcxCy 


(29) 


in which x and y represent deviation measures from the respective 
means of the two variables, i.e., x = X — and y = Y — My^ 
the sigmas in the denominator are the standard deviations of the 
two distributions, and N is the number of individuals measured. 
With reference to a scatter diagram, Mx and cx hold for the margi¬ 
nal distribution at the top, whereas My and Cy hold for the distri¬ 
bution to the right. The numerator term, 2x2/, implies that the 
product of each individuars x and y is determined, and that all 
such products are summed algebraically. There will, of course, 
be N products in this sum, some of which will be positive, some 
negative, and perhaps some zero. 

It is economical to compute r in terms of deviations from an 
arbitrary origin, instead of from the actual mean, for each vari¬ 
able. It can be shown algebraically that formula (29) is very 
nearly equivalent to the following: 


NY^dx^y - ^dj^dy 


(30) 



40 45 50 55 60 65 70 75 80 85 90 95 50 

Iowa Reading Test 

Fig, 11, Correlation scatter diagram for two tests. 



50 60 70 80 90 100 110 120 130 ' 140 107 


M 


Fig. 12. Correlation scatter for two forms of the same test. 










94 Correlation: Introduction and Computation 

in which dx is defined as an individuaPs score deviation, in step 
intervals, from an arbitrary origin on the X scale, and dy is defined 
similarly for the Y scale. The student will note the similarity of 
the radical terms to formula (6) for computing <r. Formula (30) 
calls for two sums, two sums of squares, and a sum of cross prod¬ 
ucts, all in terms of step or interval deviations from arbitrary 
origins. The arbitrary origins may be guesses at the means or 
may be taken at the bottom of each distribution. The former 
will involve handling smaller figures but will have the disadvantage 
of introducing negative numbers. The latter scheme is better if 
a calculating machine is available. 


CALCULATION OF T 

The computation of r will be illustrated for both hand and 
machine calculating methods. The hand calculation scheme here 
used may not be quite as economical as other available schemes, 
but the particular setup has the advantage that it forms an eco¬ 
nomical basis for machine computation, and the author presumes 
that practically all those who are apt to compute more than a 
few r^s will have access to a calculating machine of the Monroe 
or Marchant or Friden type. Once the steps involved in the hand 
calculation form are grasped, it becomes easy to transfer them to 
machine work. The writer has never found the commercial corre¬ 
lation charts helpful. All one needs is a sheet of cross-section 
paper ruled four lines to the inch, on which one can readily lay 
out the axes, in intervals, for tabulating or tallying. When the 
scatter diagram has been made and the tally (or dot) marks have 
been summed across and up to get the marginal frequencies (as 
shown in Figs. 11 and 12), the d values, taken from an arbitrary 
origin at the bottom-most interval for each variable, can be writ¬ 
ten, preferably with colored lead, alongside the marginal fre¬ 
quencies (see Table 13). The columns of fd and fd^ values along 
each margin can be obtained by multiplying in exactly the same 
manner as was previously done for calculating the standard devia¬ 
tion. The sums of these columns provide four of the five sums 
needed for r. 

In order to obtain each individuaPs d* must be multiplied 

by his dy, and all such products then summed. In the 140 interval 
on the y axis we find one individual whose score on the X variable 



Calculation of r 


95 


falls in the 50 interval on the x axis. In terms of step deviations 
his dy value is 8 and his d* value is 5, and therefore 5 times 8, or 
40, represents his d^y product. Another individual with the 
same dy value has a d* value of 6, whence 6 times 8 is his contri- 


Tdble IS, Computation op r 


< 

o 

o 

o 


5 

©4 

O 

11 5 33 99 1 

13 4 52 208 

9 5 45 225 

00 

I 

2 -EL* 
2 ^ 

S w 

fy 

fd. 

Jd‘‘y 

dx 

sums 

dj^x 

140 






1 

1 

1 

3 8 

24 

192 

18 

144 

130 





1 

1 

3 

1 

6 7 

42 

294 

34 

238 

120 




1 

2 

2 

1 

1 

7 6 

42 

252 

34 

204 

no 



1 

2 

3 

2 

2 


10 6 

50 

250 

42 

210 

Y 100 



2 

3 

4 

2 

1 


12 4 

48 

192 

45 

180 

90 


1 

2 

3 

2 

1 



9 3 

27 

81 

27 

81 

80 


2 

3 

2 

1 




8 2 

16 

32 

18 

36 

70 

1 

2 

1 






4 1 

4 

4 

4 

4 


1 


1 


- 







2 

0 





m 

m 













m 

1 














(61)(1097) - 

(224)(253) 



.776 




V(61)(1012) - 

(224)*V(61)(1297) 

- (253)* 



bution to 'Zdxdy The third individual in the 140 interval has a 
dx value of 7, whence 7 times 8 is his product. These three indi¬ 
viduals contribute 6X8 + 6X8 + 7X8, or 144, to the sum 
of products. The dy value of 8 is a common factor to these three 
products, whence 8(5 + 6 + 7) or 8 X 18 yields 144. This sug¬ 
gests a scheme, for computing the dxdy sum, which involves first 
summing the dx values for a particular Y interval or array and 
then multiplying this sum by the dy value. Thus the dx values 

























96 


Correlation: Introduction and Computation 


of the individuals in the 130 interval sum to 34, and in the 120 
interval to 34, and so on down to the 60 interval, which yields 2 
as the sum of the dx values. The determination of these dx sums 
is greatly facilitated by the use of a runner on which the dx values 
0, 1, 2, 3, • • •, have been labeled to correspond exactly with the 
deviations in step intervals alongside the marginal distribution 
at the top of the diagram. Since each of these dx sums is to be 
multiplied by a dy value and then all the products summed, it is 
convenient first to record the dx sums to the right as a separate 
column and then to multiply each dx sum by the corresponding 
dy value, thus leading to the last column of figures. Before these 
final multiplications are made, the column of dx sums should be 
added to see whether it agrees with the Xdx already computed 
from the marginal distribution of X scores. Thus an internal 
check is provided for the column of dx sums; all other computa¬ 
tions should be done twice in order to insure accuracy. 

When a calculator is available, the work sheet need not include 
the /d and /d^ colunms, since the sums of these two columns can 
readily be obtained by the method discussed on pp. 23-24. This 
means that the column of dx sums can be placed alongside the dy 
values; then each dx sum can be multiplied by the juxtaposed dy 
value, with the products allowed to accumulate in the dial as the 
needed Xdxdy. Thus the right-hand column of figures need not 
appear on the work sheet. 

The substitution of the five sums into formula (30) is straight¬ 
forward. The denominator factors are evaluated as explained 
on p. 24, and the numerator is obtained by punching Zdxdy into 
the keyboard and multiplying by N; then, with the product left 
in the lower dial, ^x is subtracted 'Zdy times. If needed, the two 
means can be obtained by substituting and Sd^ into (3a), 
and the two standard deviations by multiplying the proper radical 
by the interval size and dividing by N [equivalent to substituting 
the sum and sum of squares into (6)]. 

The correlation coefficient may be computed without plotting 
a scattergram by substituting gross score values into formula 
(30a): 




Rank-difference r 


97 


RANK-DIFFERENCE r 


An easily computed approximation for r can be obtained by 
the rank-difference method. The individuals are assigned their 
rank order positions with respect to each of the two variables, 
and for each individual the difference between his two ranks is 
determined. Then the N differences are squared and summed, 
and this sum is substituted into 


62^2 

^ N{N^ - 1 ) 


(31) 


This will yield p (rho) or the rank-difference measure of correlation. 
We shall illustrate the computation of p by using the 13 pairs of 
scores in Table 14. It is customary, though unnecessary, to 
assign the rank of 1 to the best individual. In a tie, the ranks are 
split between the individuals who are tied. Thus (in Table 14) 
the two individuals who scored 132 on the second trial share the 
eighth and ninth ranks, and accordingly each is ranked as 8.6. 
If there had been a third score of 132, we would have split ranks 
8, 9, and 10, assigning each of the three a rank of 9. All the com- 


Table 14- Calculation of Rank-difference Correlation 
Scores 


~7* Ranks Differences 

1st 2nd .-■-, ,-^- 


Trial 

Trial 

1st 

2nd 

d 


88 

40 

3 

1 

2 

4 

95 

46 

4 

2 

2 

4 

202 

135 

10 

10 

0 

0 

176 

98 

8 

4.5 

3.5 

12.25 

118 

115 

5 

6 

1 

1 

186 

137 

9 

11 

2 

4 

74 

63 

1 

3 

2 

4 

78 

117 

2 

7 

5 

25 

306 

294 

13 

13 

0 

0 

211 

98 

11 

4.5 

6.5 

42.25 

158 

132 

7 

8.5 

1.5 

2.25 

151 

132 

6 

8.5 

2.5 

6.25 

230 

237 

12 

12 

0 

0 






105.00 



6(105) 

630 

— 71 


p — 1 

13(169 

-1)"^ 

2184 

— .#1 



98 Correlation: Introduction and Computation 

putation shown in Table 14 will not appear on the work sheet if 
the d^s are squared on a calculating machine and allowed to cumu¬ 
late. 

If the student attempts to rank, say, 50 or 100 individuals, he 
will be convinced that it is time consuming; i.e., the determination 
of p may involve as much or more labor than the computation of 
r. For some problems the assigning of ranks is imnecessary be¬ 
cause the observations are in terms of ranks instead of measures, 
and in such cases the rank-difference method is a feasible tech¬ 
nique for measuring correlation. Rho is not as consistent from 
sample to sample as r, does not possess the mathematical advan¬ 
tages inherent in r, and therefore has merit only when the observa¬ 
tions involve ranks, i.e., are not measures. 



CHAPTER , 

Correlation: Interpretations and Assumptions 


Intelligent use of the correlation coefficient and critical under¬ 
standing of its use by others are impossible without knowledge 
of its properties. It is not sufficient that we be able merely to 
recognize r as a measure of relationship. It is a peculiar kind of 
measure which permits certain interpretations provided certain 
assumptions are tenable and provided one considers possible 
disturbing factors. Since the interpretations of r are so closely 
related to assumptions, no attempt will be made to present a 
separate discussion of these two aspects. The factors which affect 
r, and which are therefore limitations additional to assumptions, 
will be discussed in Chapter 8. 

STUDY OF SCATTERGRAM 

We shall begin by making a somewhat detailed study of certain 
properties of a typical scatter diagram. The columns and rows 
of the diagram have already been referred to as vertical and 
horizontal arrays, the intersection of two arrays has been called a 
cell, and the meaning of the marginal distributions has been given. 
If the scatter diagram depicted in Table 15 is examined, it will be 
noted that each vertical (and also each horizontal) array contains 
a frequency distribution, and that the marginal totals really 
represent the number of cases in these array distributions. These 
array distributions are very much like any other typical distribu¬ 
tion: bell-shaped with a clustering or scattering about a central 
value. The mean and standard deviation again become useful 
descriptive terms. Thus, in Table 15, the mean height of sons 
whose fathers were 64 inches tall is found to be 66.8 inches. This 
is simply the mean of the 12 cases which fall in this particular 
array. Similarly for all the vertical arrays we have the means as 

99 



100 Correlation: Interpretations and Assumptions 

recorded along the bottom of Table 15. The means of the hori¬ 
zontal array distributions have been recorded to the right of the 
scatter diagram. For example, the mean height of the 10 fathers 
whose sons were 72 inches tall is 70.0 inches. 


Table 15. Correlation Table for Height of Fathers (X) and Height 

OF Sons (F) 



K the means of the vertical arrays are plotted (see crosses in 
Fig. 13) two things will be noticed: the means are progressively 
greater as we pass from short to tall fathers, and they fall approxi- 

































101 


Study of Scattergram 

mately on a straight line. It will be noted (see dots in Fig. 13) 
that the means for the horizontal arrays also approximate a line 
and show progression. Now, with reference to the means of the 
vertical arrays, each represents the mean height of sons of fathers 
of a particular height and therefore may be used as a basis for 
predicting the height, if imknown, of a man if we have been told 



62 63 64 65 66 67 68 69 70 71 72 73 


Fig. 13. Plot of array means for data of Table 15. 

the height of his father. Thus, if the father is 66 inches tall, the 
best estimate of his son^s height is 67.6, the observed mean height 
of men whose fathers are 66 inches in height. 

Obviously such an estimation would be subject to considerable 
error, since we have also the observable fact that the heights of 
sons of fathers 66 inches tall show a large amount of variation 
about the array average. This variation tells us something about 
the possible magnitude of the error involved in using 67.6, the 
array mean, as our estimated value. The unknown height, of 




102 Correlation: Interpretations and Assumptions 

which we take an array mean as an estimate, may actually fall 
anywhere within rather wide limits on either side of the array 
mean. These limits can be described in terms of the standard 
deviation of the array distribution; i.e., the error of estimate can 
be stated in terms of a at. The standard deviation for the distribu¬ 
tion of heights of sons whose fathers were 66 inches in height is 
about 2.1. Now, if we take 67.6 as the best estimate, we can 
say that, if we were to predict the height of 100 sons (fathers 
66 inches), about 68 per cent of the time the error would be within 
the limits 67.6 =fc 2.1, 95 per cent within 67.6 ± 4.2, and nearly 
always within the limits 67.6 zb 6.3. Likewise, when the sigmas 
for the several arrays have been computed, a statement of the 
limits of the error in predicting any son^s height from his father’s 
height can be made. Such a procedure will yield as many measures 
of error as there are vertical arrays. We shall soon see that a 
convenient assumption can be made which will allow us to use a 
single indication of the error of estimate. 

Let us return again to the line of the means. Two such lines 
have been drawn in Fig. 13; one line ‘^fits” the means of the verti¬ 
cal, the other the means of the horizontal, arrays. Let us for the 
present confine our attention to the means of the vertical arrays. 
They do not lie exactly on the drawn line; some are above, some 
below. Ji they fell exactly on the line, a prediction based on an 
array mean would be precisely the same as a prediction obtained 
by noting the Y value of the line where it cuts the middle of the 
array. Furthermore, if the means were exactly on a straight line, 
we might write the equation for this line in the form Y = BX + A, 
where A equals the y intercept (value of Y where line crosses the 
,y axis) and B equals the slope of the line (the inclination of the 
line to the x axis). With A and B known, the value of Y for a 
particular X can be readily estimated. 

But, since the means do not lie exactly on a straight line, the 
above reasoning would not seem offhand to yield us anything of 
practical value. From many viewpoints, however, it is desirable 
that we determine the equation of the straight line which best 
‘‘fits” the means, i.e., the equation of a line which passes near all 
the means. Then we can use this equation instead of the array 
means in making predictions. The justification for this procedure 
depends upon the validity or tenability of an assumption: we 
assume that the failure of the means to fall exactly on a straight 



The Best-fit Line 


103 


line is due to chance fluctuations in the means. Elach array mean 
is based on a sample and consequently deviates more or less from 
the true or population value of the mean for the array. This is 
equivalent to saying that, if all the array means were based on a 
much larger number of cases, we could assume that they would 
approximate more exactly a straight line. This is an assumption 
which can always be made provided the array means for a par¬ 
ticular scatter do not show marked deviations from linear form. 
(Adequate checks in terms of probability, to be described later, 
can be utilized to ascertain whether the fluctuations from linearity 
are larger than is reasonable on the basis of chance.) 

THE BEST-FIT LINE 

We can now consider one of the advantages of using a line 
instead of the several array means as a basis for prediction. The 
location of the line is dependent upon all the means, or rather 
upon all the cases. It therefore seems reasonable to believe that 
the line would be more stable from the sampling viewpoint than 
would the array means, each of which is based on a rather small 
number of cases. 

If we accept the assumption of linearity of array means, pur 
problem is that of determining A and B so that we can write the 
equation of the line of means. We need the equations of two 
lines: Y = BX + A for the means of the vertical arrays and 
X = J8'y + A' for the horizontal array means. We shall con¬ 
sider the determination of the constants A and B for the first 
equation, but before doing so something must be said concerning 
what is meant by a ‘‘best-fit^^ line. The constant A gives the y 
intercept, i.e., tells us where the line cuts the y axis. Suppose we 
think of several possible lines having the same slope (the same 
B) as the line in Fig. 13 which passes near the crosses. Obviously, 
if we considered a line passing near the top or bottom of the scatter 
diagram, it would be a ‘Vorse fit” than that drawn in Fig. 13. 
Likewise, if we think of pivoting the line about some point, thereby 
altering its slope, it can be readily seen that rotating it to a vertical 
or horizontal position would give a worse fit. It should now be 
clear that the assigning of some values to A and B will lead to a 
worse fit than that obtained by certain other values, or conversely 
that some values will yield a much better fit than others. 



104 Correlation: Interpretations and Assumptions 

One criterion which is accepted as a basis for a best-fit line is 
that the sum of the squares of the deviations from the line shall 
be as small as possible. With respect to determining the best-fit 
line to the means of the vertical arrays, this criterion or definition 
of fit implies that the values of A and B are to be such that the 
sum of the squared deviations of the observed heights of sons— 
deviations in an up and down or vertical direction—^about the 
line will be a minimum. Stated in s3anbols, let Y' = BX + A, 
where F' (read Y prime) is the value estimated from a given X, 
and let Y be the observed value. Then (F — F')^ represents 
the squared deviation of any F from the line or estimated value. 
The problem is so to choose A and B as to make ]S(F — F')^ as 
small as possible. It is more convenient to deal with both the 
equation, = hx u, and the sum, — y')^^ in deviation 
units, with y' and y as deviations from My and re = X — 

This is merely the translation of the axes which makes the origin 
or reference point coincide with Mx and My. The student should 
visualize the meaning of this shift of axes. Note that the pattern 
of tallies is not changed by this simple transformation. Do you 
think that the slope B will equal the slope 6? Will A = a? Let 
us keep the first question in abeyance and examine now the sec¬ 
ond question. Both A and a represent the y intercepts of the 
desired prediction line. If it is not immediately obvious to the 
student that A cannot equal a, he should imagine that in Table 15 
and Fig. 13 the axes have been moved so that the origin is at the 
center of the scatter diagram, and then ask himself where the 
line through the means of the vertical arrays would cut the new 
y axis. (Incidentally, it should be noted that the value of A 
cannot be read directly from Fig. 13 for the simple reason that the 
reference frame as drawn does not include the origin. The real 
y and x axes of the original measures would be, respectively, to 
the left of, and lower than, the indicated axes.) 

It is of interest to speculate concerning the value of a in the 
equation y' = hx + a. Common sense would suggest that, if an 
individual were average on X, the best guess would be that he 
would be average on F. That is, if X = Mx, one would expect 
F' to equal My. But, if an individual's X measure fell at ilf®, his 
deviation, or x value, would be 0, and the estimated value of F 
as being equal to My would in terms of deviation scores become 0. 
This would imply that the prediction line would pass through 



The Best-fit Line 

the origin of the deviation score reference axes, and consequently 
that the y intercept would be zero; hence a = 0. For the purpose 
of simplifying the determination of the best value for b, we ask 
the reader to accept, on the basis of the above reasoning, that 
a = 0 for the best-fitting line. If we carried both a and b along 
in the following development, a would in fact turn out to be zero. 

This permits us to write y' == hx as the etjuation for estimating 
y, in deviation units, from x, or deviation values of X. Our task 
becomes that of determining the value of b which will make 
S(y — y')^ a minimum. Incidentally, it should be obvious that 
the discrepancy of any particular y value from the desired line 
has the same numerical value as the deviation of its corresponding 
original Y value from the line, and that S(?/ — 2 /')^ = 2(F — F')^. 
When we have determined the optimal value for b in 2 /' = bx, 
we can readily pass back to the original reference frames, the 
gross score axes, by substituting for 2 /' the value F' — My, and 
for Xy X — Mx. With a fixed as zero, i.e., with the y intercept 
equal to zero, we can think of the line as passing through the 
origin (deviation axes); i.e., its up and down location is fixed. 
Obviously, many lines could be drawn through the origin, and 
they would differ only as to slope, i.e., as to b. Of all possible 
lines which may be drawn through the origin, some will be closer 
than others to the observations (tallies) in toto. One might imagine 
several lines any of which would seem to constitute a good fit. 
As one takes lines with either greater or lesser slope than those of 
apparently good fit, the fits will become worse; and of those which 
seem to fit, some will actually be better than others. The student 
might think that it would only be necessary to draw what seems 
by inspection to be the best-fitting line, and then obtain its slope 
by actually measuring the angle which it makes with the horizontal 
(with needed adjustment to allow for the measurement units). 
The trouble with this procedure is that individuals would tend 
to disagree regarding which of several lines was really best; also, 
the measurement of angles would be none too exact. What we 
need is a procedure which is objective, a method that will yield 
the value of b which leads to the best possible fit in the sense of 
reducing the sum of the squares of the discrepancies to a minimum. 

We set up the function 

_ S(y - y')^ _ 2:(y - hx)^ 

^ ~ N ~ N 



106 Correlation: Interpretations and Assumptions 

in which we have N deviations of the form y — y* or y --bx (since 
y' = bx). These deviations when squared, summed, and divided 
by N give us a quantity or function which is to be minimized by 
the proper choice of 6. The value to be assigned to b can best be 
ascertained by the calculus.* This is done by taking the deriva¬ 
tive of the function with respect to b, setting this derivative equal 
to zero, and then solving for 6. Thus 

df —2Zx(y — bx) 

^ “ N 

which, set equal to zero and divided by —2, gives 


i:x(y - bx) ^ ^ 

N 

or 

2x2/ - 62x2 

-= 0 

N 

then 

2x2/ 2x2 

--5-^0 

N N 


The first or cross-product term involves the correlation coefficient 
as defined by formula (29), from which definition formula we see 
that ^xy/N = rax<ry; and since 'Ex^/N = we have 

T(rx<ry — ba^x = 0 
rffy — boTx = 0 

h = r — 

as the optimal value for 6. We therefore have 


or 

which gives 


y' = r — x (32) 

as the equation for the best-fit line. This equation is in terms of 

* The student who has not studied the calculus will either take the first 
part of the following derivation on faith or, if skeptical, will dig into a calculus 
text to satisfy himself that no magic is involved here. 



Rate of Change 


107 


deviation measures, and by proper substitution we get 

Y' -My = r^(X - M,) 
or 

Y' = X + (Uy - r- (32a) 

<^X \ Cx / 

as the equation in terms of the original or gross scores. This is 
the form which we would use in predicting Y from X. Note that 
B = b = r{(Tyla^ is the slope of this line and that the constant A 
is equal to the parentheses term. 

By similar reasoning the equation of the best-fit line to the 
means f of the horizontal arrays is found to be 

x’ = r-y (33) 

ay 

which becomes 

X' = r~ r + [Mx -r-My) (33o) 

(Ty \ (^y / 

Regression. Equations (32) and (33) in deviation score form 
and (32a) and (33a) in gross score form are known as regression 
equations^ and the constants denoting slope are known as regression 
coefficients. It is assumed that prediction will be as accurate by 
means of a regression equation as by way of array means, and it 
can readily be seen that by using a regression equation one can 
predict from intermediate values, e.g., 64^. This is of especial 
advantage with grouped data: the array mean is associated only 
with the midpoint value of the grouping interval, whereas the 
regression line is not so limited since it is continuous. 

Rate of change. The results of the foregoing derivation make 
it clear that the correlation coefficient, along with the two means 
and the two standard deviations, enables us to write the equation 
by which either variable can be predicted from the other. The 
regression coefficients indicate the rate of change —unit of change 
in one variable per unit of change in the other—and in case the 
two standard deviations are equal, r itself indicates the rate of 

t More strictly speaking, we are fitting a line to means weighted according 
to their respective iV^s; i.e., we are fitting a line to the observations. 



108 Correlation: Interpretations and Assumptions 

change. Thus we have one of the possible interpretations of the 
correlation coefficient. 

For the correlation table in Table 15 we get, by proper substitu¬ 
tion, the following as the regression equations: 

y' = .52X + 33.24 (to estimate son^s height) 

X' = .60y + 26.63 (to estimate father^s height) 

The student should study Fig. 13 sufficiently to convince himself 
that .52 is the slope of the line passing near the crosses, and that 
•60 represents the slope (with reference to the vertical) of the line 
through the dots. The student should also satisfy himself that 
the constants 33.24 and 26.63 really represent the points at which 
the two lines intercept the y and x axes. Finally he should show 
that, if a father^s height is at the mean of all fathers, the mean 
of the heights of all the sons is the best estimate of his son^s height. 


ACCURACY OF PREDICTION 

The next problem to which we turn is concerned with the accu¬ 
racy of prediction by means of a regression equation. It has 
already been indicated that, when the mean of an array is used in 
prediction, the error of estimate is a function of the spread within 
that array. By introducing an assumption it becomes possible 
to substitute one measure of error in place of the several, numer¬ 
ically different, array standard deviations. An examination of 
the array distributions in Table 15 reveals that the vertical arrays 
differ from each other very little in dispersion (likewise, the hori¬ 
zontal arrays). If we were to compute the standard deviations 
for the vertical arrays, we would find differences, for this diagram, 
of such size as could readily be attributed to chance or sampling 
fluctuations; i.e., we assume that, if we had a much larger N, the 
array dispersions would be very nearly equal. Ordinarily this 
assumption of homoscedasticity can be met, and one measure of 
dispersion can be used for all the vertical arrays (and another for 
all the horizontal arrays). 

Error of estimate. One such measure might be an average of 
the array a% but to determine this we would need first to compute 
all the cr^s, a somewhat laborious job. Since we are to use the 
regression line, instead of array means, as a basis for prediction, 



Error of Estimate 


109 


we really need something corresponding to the tr about this line. 
Such a value can be obtained by noting that y — y' (ov Y — Y') 
represents the discrepancy between estimated and observed values 
and that S(i/ — y')^/N is the mean of the squared deviations, 
the root of which will be the standard deviation of the discrep¬ 
ancies between estimated and observed values. This will be 
taken as the one standard deviation to replace the several stand¬ 
ard deviations as our measure of the error of prediction. This 
particular standard deviation, defined as the square root of 
2 ( 2 / — y'Y/Ny is called the standard error of estimate. It may be 
determined in two ways. First we can take a roundabout way 
which involves these steps: the prediction of each Y by use of 
equation (32a), or each y by use of (32); the calculation of the 
discrepancies (7 — F') or {y — y‘)] squaring, summing, dividing 
by N, and taking the square root. A quicker method for deter¬ 
mining the standard error of estimate is readily derived alge¬ 
braically. 

Let ay.x stand for the standard error of 7 as estimated from 
X) then by definition, 

S(r - Y'f 2(2/ - y')^ 

= —F—= -ir“ 

but 


y’ ^ r — x 

Ox 


by formula (32) whence 


1^/ V 

= —— ^ — X ] 

N \ Ox J 

= ^ ^ ( 9 / - 2 r — xy + r^ xA 
N \ Ox O^x / 

iST <tAN/ o^s\N / 

Ox ^ X 

= r V„ 


then 


( 34 ) 



110 Correlation: Interpretations and Assumptions 
By a siinilar line of reasoning it can be shown that 


= ffxVl - ^ (35) 

which gives the standard error of X as estimated from 7. 

Thus the correlation coefficient not only enters into the predic¬ 
tion equations (32 to 33a), but also permits us to gauge the accu¬ 
racy of prediction. It should be noted in passing that one can 
write the equation of a best-fit line without first determining r 
and that the error of prediction can also be ascertained without 
recourse to r. Such a method for determining the error of estimate 
has already been indicated: the square root of 2(7 — 7')^/iV’, 
in which 7—7' represents the computed discrepancy between 
observed and predicted values. This need not involve r imless 
the prediction equation is written in terms of r, as was done in 
(32a). The equation 7' = A + BX can be written in the form 

, 2X^27 - 2X2X7 JV2X7 - 2X27 

^ iVSX^ - (SX)2 "** iVSX^ - (SX)2 

in which X and 7 stand for gross or original measures. Formula 
(36) for the best-fitting line (least squares solution) does not 
involve means, <r's, or the correlation coefficient. If, as is fre¬ 
quently the case, one is interested in obtaining the equation for 
7 only, it will be noticed that it is unnecessary to compute the 
sum of the 7 squares, which is not, however, a tremendous saving 
of time. Perhaps the quickest way for determining the equation 
is by direct substitution into (36), but the determination of the 
error of estimate (sometimes called the goodness of fit of the line) 
is certainly facilitated by calculating r and ay and substituting in 
(34). 

The standard error of estimate is to be interpreted as a standard 
deviation, and in so doing we are tacitly assuming that the array 
distributions are not only .equal in dispersion but also normal. 
For the correlation diagram in Table 15, we have ay.x = 1.9, 
which is to be considered the standard deviation of the 7 values 
about the regression line, 7' = .52X + 33.24. By use of this 
equation we would predict that the height of the son of a man 
70 inches tall (X = 70) would be 69.6, and the error of estimate, 
1.9, would be interpreted by saying that, if we made many such 
predictions, 68.26 times out of a hundred the actual height of sons 



Error of Estimate 


111 


of 70-inch fathers would be within the limits 69.6 ± 1.9, and nearly 
always within the limits 69.6 =t 3(1.9). 

This is a second method for interpreting the correlation coeffi¬ 
cient: in terms of the accuracy of prediction or goodness of fit of 
regression lines. If no correlation exists, the errors of estimate 
are Oy.x = <^v and Ox-y = (^x- In this connection it can be seen 
from formulas (32a) and (33a) that, when r = 0, the estimated 
y, y', becomes My, and X' becomes Mx. For example, if it has 
been established that the correlation between toe length and IQ 
is zero, we would always take 100 (the mean) as our best guess 
for an individuars IQ regardless of toe length. The error of esti¬ 
mate would of course be the standard deviation of the distribution 
of IQ’s, and it would be said that toe length is useless in predict¬ 
ing IQ. The scatter diagram for IQ as Y and toe length as X 
would exhibit the following characteristics: first, the regression 
line Y' = A + BX would be horizontal, i.e., B would equal zero, 
and the means of the arrays would fluctuate about the value 
My, or A would equal My] and, second, all the array distributions 
would have dispersions approximately equal to ay. What would 
be the best guess as to the other regression line and the standard 
deviations of the horizontal arrays? 

Now suppose the correlation between the variables were perfect 
(r = +1 or —1). The tallies in the scatter diagram would lie 
in a line, there would be no spreading about this line, the two 
regression lines would, coincide, and no error would be involved 
in estimating X from y or y from X, That Oy.x and ax-y would 
both be zero in case of perfect correlation is quite evident when 
one considers formulas (34) and (35). 

At this point the student should note the difference between 
positive and negative correlation. In the case of a positive r, a 
high score goes with high and low with low, whereas, for a nega-, 
tive r, high goes with low and low with high. With reference 
to the scatter diagram, a negative r typically involves a swarm 
of tallies stretching from the upper-left to the lower-right comer, 
whereas for a positive r the trend is from lower left to upper right 
(this assumes that the axes have been laid off in the conventional 
fashion). With reference to the regression equations, a negative 
r yields negative regression coefficients or negative slope for the 
lines. The student should be warned that an apparently negative 
r may in reality be positive. Thus, if one variable is a test or 



112 Correlation: Interpretations and Assumptions 

performance scored in terms of time (or errors) and the other 
variable is scored in terms of amount done, the scatter diagram 
might show large time scores as going with small amounts of 
work done, i.e., high with low, which might be wrongly taken to 
indicate negative rather than positive correlation. Instead of 
asking whether high goes with high and low with low, it is safer 
to ask whether best goes with best. This rule, however, is diffi¬ 
cult to apply when we are dealing with the interrelation of per¬ 
sonality traits, especially those which do not readily permit of a 
statement as to which is the desirable end of the trait scale. The 
sign of the correlation coefficient in such cases always needs a 
qualifying statement which explicitly tells the direction of the 
relationship between the variables. Obviously, as far as accuracy 
of prediction is concerned, the error is the same for a negative and 
positive r of the same magnitude. 

Alienation. To return to the interpretation of the correlation 
coefficient by way of the standard error of estimate, we see that 
the factor in formulas (34) and (35) which involves r is Vl — 

It is the value of this which, when multiplied b y the p roper <r, 
leads to the error of estimate. The expression Vl — is called 
the coefficient of alienation. If r is zero, its value is 1 and the 
error of estimate is the <r for the variable being estimated. Table 
16 gives the value of the coefficient of alienation for varying values 
of r. The student will do well to fix in mind the trend in this 
table. It will be noted that, compared to a correlation of zero, an 
r of .60 reduces the error of estimate by 20 per cent, whereas an 
r of .30 reduces it by about 5 per cent; that r must be as high as 

Table 16. Values of the Coefficient of Alienation 


r 

Vl -r2 

r 

Vl 

.00 

1.000 

.60 

.800 

.10 

.995 

.70 

.714 

.20 

.980 

.80 

.600 

.30 

.954 

.866 

.500 

.40 

.917 

.90 

.436 

.60 

.866 

.95 

.312 


.866 before the error of estimate is reduced by one-half; and that 
the difference in reduction between an r of .70 and an r of .90 is 
approximately the same as that between .20 and .70. This inter- 



Alienation 


113 


pretation of r is most useful and at the same time most disturbing, 
since the errors of estimate for r's in the vicinity of .40 to .70, 
values usually found and utilized in predicting success from test 
results, are discouragingly large. 

A somewhat different way of grasping the meaning of r, as it is 
applied to accuracy of prediction, is to square both sides of formula 
(34) and then solve explicitly for r. This leads to 



from which it is readily seen that the correlation coefficient depends 
upon the accuracy of prediction relative to the total variance of 
the variable being predicted. 

It might be well at this time to bring together a few remarks 
concerning the assumptions involved in using and interpreting a 
correlation coefficient in terms of either rate of change or accuracy 
of prediction. When an r is reported, and no evidence to the con¬ 
trary is given, one has a right to expect that the assumptions of 
linearity of regression and homoscedasticity have been met. The 
interpretation of r as rate of change definitely assumes linearity, 
and the interpretation in terms of the error of estimate definitely 
assumes both linearity and homoscedasticity. In certain special 
cases where the investigator is interested only in a one-way pre¬ 
diction, say Y from X, and there is no likelihood of ever reversing 
to predict X from F, dt will suffice if the regression of Y on X, 
i.e., for predicting Y from X, be linear and the Y or vertical array 
distributions be homoscedastic. The use of the correlation coeffi¬ 
cient in predicting performance from age may be cited as an in¬ 
stance in which one need not worry about the possible nonlinear 
regression of age on score or the lack of homoscedasticity about 
this regression line. 

The student may have observed that no assumptions have been 
made concerning the nature of the marginal distributions; the 
utilization of r does not assume normal distributions for the 
variables being correlated. The use of the standard error of esti¬ 
mate, however, assumes normality of the array distributions. As 
regards the possible effect of nonnormal marginal distributions, 
experience shows that nonlinearity, lack of homoscedasticity, or 
nonnormality of arrays may frequently be associated with skew¬ 
ness in one or both of the marginal distributions. 



114 Correlation: Interpretations and Assumptions 

Although there are adequate checks for linearity and homosce- 
dasticity, a careful scrutinization of the scatter diagram is usually 
sufficient to warn one of violent departures from these assump¬ 
tions. Hollerith and other nonplotting schemes for computing 
r give no inkling as to whether these assumptions are being vio¬ 
lated and therefore cannot command the confidence of the careful 
investigator. The purpose of a research project might very well 
be the study of the relationship between two variables, but an 
end result in terms of a correlation coefficient, with no attention 
given to the form of the relationship, is inadequate. 

VARIANCE AND CORRELATION 

A third method of interpreting r is in terms of variance. Before 
discussing this interpretation, we must introduce an important 
theorem concerning the variance of a sum (or difference). Suppose 
that variable W is made up of two parts U and V such that 
W ^ U + V, For example, the score on an arithmetic test might 
consist of two parts: score in addition and score in multiplication. 
Obviously, w ^ u + v, and therefore the variance of the W varia¬ 
ble is 

= - 2(u + »)* 

N 

= ^ (Sw® + 2t>2 + 22Mr) 

= <r?u + <r^v + 2ruv<Tu<^v (37) 

and in case U and V are independent, we have 

(37 a) 

If we are dealing with the difference, IF — C/ ~ F, we have 

= (T^u + <r^v — 2ryyffucry (38) 


and for U and V independent, we have 



Variance and Correlation 


115 


which is identical with (37a). In words, the variance of a sum (or 
difference) of two independent variables is equal to the sum of their 
separate variances. Variances are additive, whereas standard 
deviations are not. It can be shown that, when U and V are 
distributed normally, their sum or difference will also yield a 
normal distribution. 

Now, with regard to the third method for interpreting r, let us 
note that in deviation units an observed y can be thought of as 
made up of two independent parts, the part which can be pre¬ 
dicted from Xj namely 2/', and the residual or unpredictable part 
which we designate as Before going farther we must demon¬ 
strate that !/' and Zy.x are really independent. The numerator 
for the correlation between y* and Zy.x can be expressed as 

But, since y* — r^--x and Zy^x = y — r — x, we have 

<Tx (Tx 

V / ^ \ 

^yZy x = ^r — x\y r — x] 

(Tx \ ! 

(Sx <rx 

0- X 


which is seen to be zero; hence y' and Zy.^ are uncorrelated. 

We have y = y* + Zy.x, whence, by the above variance theorem. 


= (T^y* + 


V'X 


(39) 


in which is the variance of the residuals, Zy.x> If we divide 
both sides of this equation by <r^y, we get 


1 = 




(39a) 


from which we see that, since the two ratios add to unity, either 
one can be interpreted as a proportion (or a percentage by shifting 
the decimal point). Thus the ratio of o^yf to a^y is the proportion 
of the variance in Y which can be predicted from X, and the ratio 
of a^y.x to <T^y represents the proportion of the variation (variance) 
of Y which is left over or remains or cannot be predicted from X. 



116 Correlation: Interpretations and Assumptions 


A little reflection as to the meaning of this residual variance should 
convince the student that we are here dealing with the same 
variance which results if we square formula (34), thus 

0’%-a; = ““ 

which means that 



When we substitute this value into (39a), we have 
l=^+l-r2 

U y 

from which it is readily seen that the ratio 



That is, the square of the correlation coefficient gives the propor¬ 
tion of the total variance of Y which is predictable from X, or 
measures the proportion of the Y variance which can be attributed 
to variation in X. The proportion of the variance of Y which is 
due to variables other than X is given by 1 — By shifting 
decimals, we can think of as indicating a percentage, the per¬ 
centage of variance Avhich has been explained, and 1 — as the 
percentage of variance due to other causes. It will be noted that 
not r, can be so interpreted. This is true because variances 
are additive, whereas standard deviations are not. It should be 
emphasized that as a proportion has to do with variation ex¬ 
pressed technically as variance. 

It is of some interest to examine the meaning of It is the 
square of the standard deviation of the estimated values, and, 
with reference to the scatter diagram, (Ty^ corresponds approxi¬ 
mately to what we would obtain if we were to compute the stand¬ 
ard deviation about My of the vertical array means, each weighted 
according to the number of cases in its array. As an exercise, the 
student can prove by determining directly, rather 

than by formula (34), that a^yf = (Hint: use the deviation 

score form of the regression equation.) 

This third method of interpreting a correlation coefficient 
assumes linearity of the regression line involved in predicting Y, 
or the dependent variable, from X as the independent variable; 



Correlation and Common Elements 


117 


i.e., the regression of F on X must be linear. If X were con¬ 
sidered as the dependent variable, then the interpretation that 

indicates the proportion of the variance of X explained by Y 
would assume linearity for the regression of X on F. The assump¬ 
tion of linearity becomes explicit if one proves directly that 
(T^yf = and it was implied when we used a^y.x in that this 

residual variance was taken about a straight line. This interpre¬ 
tation does not assume homoscedasticity, nor does it assume 
normality either for the marginal or for the array distributions. 

The investigator who is interested in analyzing variation and 
its possible causes will prefer the interpretation of the correlation 
coefficient in terms of variance. The problem is frequently one 
in which an attempt is made to explain variation in one trait in 
terms of variation of another which is conceived of as being more 
basic. The use of as the percentage of the variance of a trait 
which is predictable by, or attributable to, variation in a second 
variable becomes a valuable tool in the analysis of variation. Of 
course one must use caution in assuming causation of one variable 
by another. Logic, not statistical method, must be invoked to 
determine whether a causal relationship exists, and the statistical 
interpretation modified accordingly. Variation in X might cause 
variation in F, or vice versa, or variation in both X and F might 
be due to the influence of some other variable or variables. 

To illustrate the interpretation of as a percentage, let us sup¬ 
pose we have the performance of a group of school children on a 
substitution test. Considerable variation in scores will be pres¬ 
ent, and we may rightfully ask whether a portion of this variation 
is due to age differences. We can determine the correlation be¬ 
tween age and performance. Suppose r — .60; this can be inter¬ 
preted by saying that 36 per cent of the variance in performance 
is due to age differences, and 64 per cent is due to other causes. 
Likewise, the variance in crop yield due to variation in rainfall 
can be determined; or the variance in the height of a group of men 
may be analyzed into two or more parts, one of which might be 
the portion due to variation in the heights of their fathers. 

CORRELATION AND COMMON ELEMENTS 

A fourth possible interpretation of the correlation coefficient 
assumes that each of the two variables can be thought of as a 
summation of a number of equally potent, equally likely, inde- 



118 Correlation: Interpretations and Assumptions 

pendent elements, which can be either present or absent. Then 
the degree of correlation is a function of the number of elements 
common to the two variables. The general formula is 


Uc 




71^ “b Tty “I” 


(40) 


in which Ux equals the number of elements unique to X, Uy the 
number unique to F, and ric the number common to both variables. 
If the number of elements in X equals the number in 7, r gives 
the percentage of elements common to X and 7; if X is deter¬ 
mined only by elements conunon to 7, while 7 has additional 
elements, gives the percentage of elements entering into 7 
which determine X. There is little, if any, factual basis for believ¬ 
ing that the assumptions stated above are tenable so far as psycho¬ 
logical variables are concerned, and therefore the interpretation 
of the correlation coefficient in terms of common elements may 
be viewed with scepticism. 


NORMAL CORRELATION 

A fifth interpretation of r is more mathematical but of little 
practical value. We have already seen how a frequency distribu¬ 
tion and its polygon can be thought of as smooth, conforming 
perhaps to the equation of the normal curve. A correlation table 
is a frequency distribution, a picture or graph of which requires a 
third dimension. If we were to replace each tally in a scatter 
diagram by a thin block, there would result something analogous 
to the histogram except that it would be three dimensional—the 
heights of the stacks of blocks would indicate the frequencies for 
the various cells. Now suppose that this mound of blocks is by 
some method smoothed to a surface, and we consider the total 
volume \mder the surface (between the surface and the X7 plane) 
as representing N. Then the number of cases falling between two 
given X values and simultaneously between two given 7 values 
will be approximately the volume of that portion of the mound 
which has as its base the rectangle or square formed by the inter¬ 
sections of the two X and two 7 values. If the regression lines 
are linear, if the array distributions are normal and homoscedastic, 
and if the marginal distributions are normal, the resulting surface 



Limits for r 119 

is termed the normal correlation surf ace, and the equation of the 
surface can be written as 

M 1 / I ^rxv\ 

2 = - - . e 2 (l-r 2 )V<r 2 x VzOy) 

27r(7’xO’|/V 1 —• 

A number of important properties of the normal correlation sur¬ 
face can be deduced from this equation and its integral. For in¬ 
stance, the standard error of estimate can be derived from formula 
(41), and it can also be shown that the contour lines which repre¬ 
sent different altitudes on the mound, i.e., different frequencies, 
will be concentric ellipses, and that if r = 0, the contour lines will 
become concentric circles. If the equation is written with JV equal 
to unity, by double integration the probability of an individuars 
falling between two particular Y values and between two X 
values can be determined. Tables are available which can be 
utilized for this purpose, t 


LIMITS FOR r 

Attention is called to the fact that definition formula (29) be¬ 
comes r = XzxZyJN, when written in terms of standard scores 
for both variables. This indicates specifically that the correla¬ 
tion coefficient is a statistical average, the average of the cross 
products of standard scores. Suppose that we ask what happens 
when the correlation is perfect in the sense that each individual’s 
Zx score equals his Zy score. If this is true, the sum XzxZy would 
be the same as which when divided by N gives 1.00. Thus 
the upper limit for r is +1.00. Now suppose a perfect inverse 
relationship, such that an individual’s Zx and Zy are the same 
except for sign, one being positive whereas the other is negative. 
If this holds true for all the cases, the sum 'EzxZy can be written as 
Sz(—z) or —Sz^, which when divided by N gives —1.00 as the 
limit for perfect negative correlation. 

As exercises, the student should show that multiplying or divid¬ 
ing either X or F or both by a constant, or X by one constant and 
Y by another, will not change r, and that adding or subtracting a 
constant does not affect the value of r. 

t Pearson, Karl, Tables for statisticians and biometricianSf part //, Cam¬ 
bridge: Cambridge University Press, 1931. See Tables 8 and 9. 



120 Correlation: Interpretations and Assumptions 

SUMMARY 

The five suggested methods for interpreting the correlation 
coefficient may be briefly summarized here. 

1. r is associated with the rate at which one variable changes 
with another. This assumes that the regression line so interpreted 
is linear. 

2. r tells us how accurately we can predict by a regression equa¬ 
tion. The standard error of estimate permits one to infer the 
possible magnitude of the prediction error, whereas the coefficient 
of alienation indicates the reduction in error over that error which 
would exist if there were no correlation. This interpretation 
assumes that the regression line used in predicting is linear and 
that variation about this line is normal and homoscedastic. 

3. gives the proportion of variance in Y predictable from, 
or attributable to, variation in X. This assumes linearity for the 
regression of T on X and requires caution in assuming the direc¬ 
tion of cause and effect. 

The student should attempt to visualize the meaning of these 
three principal methods of interpreting correlation. In particular, 
he should note the meaning of <ry, (j„/, and Cy.x (or their counter¬ 
parts with the subscripts y and x interchanged). The first, (Ty, 
holds for the marginal distribution of all F^s; (fy> pertains to the 
variability of all Y values as predicted from X; the third, Cyxy is 
a measure of the variation about the regression line for predicting 
Y from X. 

4. r or r* can be interpreted in terms of the percentage of ele¬ 
ments conunon to the two variables provided we are willing to 
make rather hazardous and unrealistic assumptions as regards 
the nature of the variables. 

5. r can be interpreted mathematically in terms of the equation 
for the normal correlation surface. This assumes that both regres¬ 
sions are linear, that homoscedasticity and normality hold for 
both the horizontal and vertical array distributions, and that 
both marginal distributions are normal in form. 

The nature of the investigation will usually dictate or suggest 
the appropriate interpretation. Ordinarily the fifth will not be 
used in connection with the application of the correlational method, 
whereas the fourth rests on assumptions which can seldom be met. 



CHAPTER 8 


Factors WMch AfiEect the Correlation Coefficient 


Before we interpret, or draw conclusions from, a particular 
correlation coefficient, it is necessary that we ask ourselves. What 
factors might have affected its magnitude? The size of an ob¬ 
tained T depends upon several specific conditions, and, even though 
it is not always essential that corrections be applied, the investi¬ 
gator must forever be on the lookout for correlations which deviate 
from their ‘True^^ value because of the operation of disturbers. 
This chapter will be devoted to a discussion of the more common 
factors which influence r. 

It is assumed that errors in computation have not been per¬ 
mitted—^that all arithmetical work has been checked. It is also 
assumed that sufficient intervals have been used so as to make 
unnecessary the application of Sheppard^s correction for grouping; 
if more than twelve intervals have been used, the slight increase 
in r which results from correcting the standard deviations will be 
negligible. Certain textbooks have advocated a correction to r 
for smallness of the sample, which correction reduces r by a negli¬ 
gible amount. In view of the magnitude of the effects of other 
factors on r, these two possible corrections seem trifling. 

SELECTION 

One of the first questions which must be faced is: Do the cases 
upon which r is based represent a random sampling of some de¬ 
fined population, or have selective factors so operated as to in¬ 
crease or decrease r? The literature of psychology is not free from 
correlation coefficients which are decidedly different from values 
that would have been obtained had the sampling been random. 
This is not to say that any investigator has willfully selected his 

121 



122 Factors Which Affect the Correlation Coefficient 


cases so as to produce correlation, but rather to say that unwitting 
errors are frequently present in spite of an effort to avoid selective 
factors. 

SAMPLING ERRORS 

Even though one feels reasonably sure of the randomness of the 
sample upon which an r is based, it is still necessary to consider 
the obtained r in terms of variable errors due to sampling. Any r 
based on N pairs of observations will differ more or less from the 
universe or population value, f, which is here conceived of as the 
value of the correlation coefficient which we would obtain if we 
had an infinitely large sample. Many of the older texts gave 
(1 — T^)/y/N as the standard error of r, but failed to point out a 
serious limitation as regards interpretation: that this is an approx¬ 
imation and that r’s for successive samples are not distributed 
normally unless N is large and/or the universe value, is near 
zero. 

Before further discussion it should be said that some measure 
of the sampling fluctuation of the correlation coeflScient is highly 
desirable for any of three reasons: (1) We may wish to say whether 
an obtained r can be taken as representing a real, nonchance, 
correlation, i.e., whether it deviates sufficiently far from zero so 
that we cannot regard it as a chance fluctuation from no relation¬ 
ship; (2) we may wonder whether a given r deviates significantly 
from some a priori or expected value; or (3) we may raise the ques¬ 
tion of whether two obtained r's are significantly different from 
each other. The answers to these questions must be in terms of 
probability, and the probability figure which we accept as indi¬ 
cating significance determines the confidence with which we re¬ 
gard any such conclusions as we set forth. 

If N is greater than 30, and if we are interested in saying whether 
or not an r (of .50 or less, usually) is significantly different from 
zero, we can determine its Standard error by 



and then divide the obtained r by this standard error in order to 
secure an xI<t value with which to enter the normal probability 
table. If r/o-r is greater than 2.58, we can conclude with a fairly 



The r to z Transformation 


123 


high degree of confidence that the true or universe value for r is 
likely to be greater than zero. For N less than 30, it is necessary 
to follow a different procedure; this will be discussed in Chap¬ 
ter 12. 

Formulas for the standard error of r, when f is large, are mis¬ 
leading because for high values of f the distribution of successive 
sample values is markedly skewed. This skewness becomes 
noticeable when f reaches .40 or .50 and increases rapidly as f 
nears unity. The skewness is also a function of N, Because of 
this skewness the standard error of r loses its meaning; it cannot 
be expected to yield a trustworthy answer as to whether an ob¬ 
tained r deviates significantly from some a priori value, nor can 
the significance of the difference between two r’s be determined 
by substituting in the ordinary formula for the standard error of 
a difference. 

The r to z transformation. Professor R. A. Fisher has 
developed a very useful and accurate technique for handling sam¬ 
pling errors for high values of r. This procedure is also applicable 
for low r^s and can be used when N is large or small. He employs 
a transformation 


or 


2=5 log* (1 + r) - I loge (1 - r) (43) 

1 + r 

2 = 1.1513 logic- (43a) 

- 1 — r 


which has two distinct advantages: (1) the distribution of z for 
successive samples is independent of the universe value, f] i.e., 
for a given N the sampling distribution will have the same dis¬ 
persion for all values of (2) the distribution of z for successive 
samples is so nearly normal that it can be treated as such with 
very little loss of accuracy. The standard error of z is 


_ 1 
“ Vn - 3 


(44) 


If we wish to state confidence limits for we transform the 
obtained r to z by formula (43a) or by Table B of the Appendix, 
determine <Tz, find z + 2.58<r;s and z — 2.58(r«, and then transform 
these two z values back to r's by using Table C. As an example 



124 Factors Which Affect the Correlation Coefficient 


and in contrast to the less exact procedure of taking r =t 2.68ory, 
where <rr = (1 — r^)/y/Nj let us suppose an r of .90 based on an iV 
of 50. The standard error of r by the usual formula is .027, whence 
.90 zh (2.58) (.027) yields the values .830 and .970 as confidence 
limits for the universe value. Now, if we utilize the z transforma¬ 
tion, we find z = 1.47, and <rz = .146, whence 1.47 zt (2.58) (.146) 
gives 1.093 and 1.847. These two values are then transformed 
back to the two r values, .798 and .951, which it will be noted 
differ from the confidence limits for t as determined by the clas¬ 
sical method. 

Difference between r’s. If we wish to determine the signifi¬ 
cance of the difference between two r’s, both are transformed into 
z^s, and the standard error of the difference between the two z's 
is obtained by 



and then the ratio of the difference to its standard error is treated 
in the usual manner. If the z*b are significantly different, we con¬ 
clude that the two r^s are significantly different. 

The problem of obtaining a measure of the significance of the 
difference between two r’s, say ri 2 and ri 3 or ri 2 and r 34 , which 
are based on the same group is complicated in that for successive 
samples there will be a relationship between ri 2 and ri 3 and may 
be a relationship between ri 2 and r 34 . This situation frequently 
arises in practice, and an approximation to the standard error of 
the difference is obtainable by substituting the somewhat meaning¬ 
less standard errors of the two r^s and an expression which repre¬ 
sents the relationship between them into the standard error of 
the difference formula, which would then read 


+ (T^ro - 2rrr0rr 


where o-r = (1 — t^)/\/N and Vrr is the correlation between the 
two correlation coefficients. This correlation for comparing ri 2 
and ri 3 is given by 




Range or Spread of Talent 


125 


and for comparing ri 2 and we obtain this correlation by 
1 




rr;- r-rj: -“ »‘i2r23)(r24 - r23r34) 

2(1 - r^2)(l - ^234) 


+ (ri4 - ri3r34)(r23 - ri2ri3) + (rig - ri4r34)(r24 - rigr^) 


+ (ri4 - rj2r24)(r23 - rurz*)] (48) 

It is necessary to use the value obtained by formula (47) in the 
02 ) formula when the correlation between variables 1 and 2 is 
being compared with that between 1 and 3 when both coefficients 
are based on the same sample. Likewise formula (48) must be 
used when the difference between ri 2 and r 34 is being evaluated 
provided the two r^s are based on the same sample. 

However, this procedure will be grossly in error for determining 
the significance of the difference between high r^s. What is needed 
is the correlation between z^s to enable us to include a correla¬ 
tional term in formula (45). For the situation which involves 
the same sample, Wi = iV ’2 = iV, and the formula for the c of 
the difference between correlated becomes 


12 - 2r„ 

~ V jv” - 3 

Presumably will be equal to Vrr as obtained by formula (47) 
or (48). The use.of (49) should yield more meaningful results 
than (46) and is therefore recommended as a substitute for for¬ 
mula (46), which involves the untenable assumption that correla¬ 
tion coefficients from successive samplings are distributed nor¬ 
mally. Formulas (47) and (48) provide the only available way 
for estimating the needed Vrr or Little is known about the 
approximation error entailed in their usage. 


RANGE OR SPREAD OF TALENT 

The magnitude of the correlation coefficient varies with the 
degree of heterogeneity (with respect to the traits being corre¬ 
lated) of the sample. If we are drawing a sample from a group 
which is restricted in range with regard to either or both variables, 
the correlation will be relatively low. Thus the restricted range 
of intelligence is one factor which leads to lower correlation be- 



126 Factors Which Affect the Correlation Coefficient 


tween intelligence and grades for college students than that usu¬ 
ally foimd for high school groups. If the range with respect to 
one variable has been curtailed, and one knows the standard 
deviation for an uncurtailed distribution, it is possible to adjust 
the correlation for the difference in range, provided one can be 
sure of the tenability of two assumptions: that the regressions are 
linear and that the arrays are homoscedastic for the scatter based 
on the imcurtailed distribution. If the curtailment is in variable 
X, and we let 


(Tx = SD for curtailed distribution, 

Sx = SD for uncurtailed distribution, 

Txy = correlation of variable Y with X for curtailed range, 

Rxy = correlation of variable Y with X for uncurtailed range. 


the relationship by which we would predict Rxy from ctx, 
Txy is given by 


Rxii — 


_ Txy(^x/ _ 

Vl - 


^xj and 


(50) 


Obviously, if we have R instead of r, the value of r for a restricted 
range can be estimated by formula (50). All we need to do is 
interchange S and <r, R and r, and then substitute to find r. The 
estimation of r need not be made in ignorance of whether the 
assumptions of linearity and homoscedasticity can be met; an 
examination of the accessible scatter for the uncurtailed range 
will reveal the facts. 


Table 17. Values for r»y for of .30, .40, • • • .80 with o-x/Zy Values 
, OF .90, .80, • • • .60 





.30 

.40 

.60 

.60 

.70 

.80 

.90 

.272 

.366 

.461 

.659 

.662. 

.768 

.80 

.244 

.330 . 

.419 

.614 

.617 

.730 

.70 

.216 

.292 

.375 

.465 

.566 

.682 

.60 

.185 

.253 

.327 

.410 

.507 

.625 

.60 

.165 

.213 

.277 

.361 

.440 

.655 


Formula (50) indicates definitely that the magnitude of the 
correlation coefficient is a function of the degree of heterogeneity 
with respect to one of the traits being correlated. A better appre¬ 
ciation of the extent of this influence can be had by examining 



Reliability 127 

Table 17 which gives, for varying values of Bxy along the top and 
different (Tx/'^x ratios along the left, the corresponding values of 
Txy. It can be shown that double selection, i.e., curtailment 
on both variables, tends to depress the correlation coefficient. 
Since the formulas for “correcting’^ for double curtailment are 
not too satisfactory, none is given here. 

One important rule emerges from the foregoing: standard devia¬ 
tions should always be reported along with correlation coefficients, 
and some indication should be given as to variation t 3 Tpically 
found for the variables. 

EFFECT OF UNRELIABILITY 

Before considering the effect of unreliability, or errors of measure¬ 
ment, upon the correlation between two variables, it is necessary 
that we digress to explain briefly what is meant by reliability. If 
we were assigned the task of determining the height of an indi¬ 
vidual by the use of a tape measure, we might be satisfied with 
one measurement, but unfortunately a single determination might 
not be entirely free from error. To overcome this, two or more 
measures are averaged on the assumption that the chance or 
variable errors will more or less cancel out. If one computes the 
standard deviation of the distribution of several measurements 
(of the same thing), a summary figure indicating the possible 
magnitude of the vapable errors will be obtained. This cr neither 
pertains to nor measures the magnitude of a possible constant 
error, i.e., an error which affects all the measurements in the same 
direction. We are here concerned only with the magnitude of 
variable errors, or inaccuracies in measurement which are of a 
chance nature. 

Reliability. If we had the problem of determining the error in 
the measurement of height, we could make several measurements 
on one person and compute a measure of accuracy, or we might 
make just two measures on each of several persons and take the 
mean or median difference between the two measurements for all 
N individuals as our gauge of accuracy. Either scheme leads to 
an estimate of the size of the variable errors that may be involved. 

In psychological measurement, it is not always feasible or pos¬ 
sible to obtain more than two measures on an individual for a 
given trait; hence it is necessary to use the second-mentioned 



128 Factors Which Affect the Correlation Coefficient 


scheme for determining the accuracy of measurement. The mean 
or median absolvJte error may suflSce, but, as in physical measure¬ 
ment, we sometimes need to know the extent of the variable errors 
in relation to the magnitude of the thing being measured, i.e., the 
relative or percentage error. Psychologists have found it useful 
to interpret variable errors, not with regard to the magnitude (a 
nearly meaningless word in psychological tests) of the measures, 
but relative to the variability of the trait for a specific group of 
individuals. The correlation between two •determinations is, as 
we shall soon see, one method of expressing the accuracy of meas¬ 
urement relative to the trait dispersion. Such a correlation is 
termed the reliahility coefficient 

Suppose X = an obtained score or measure for an individual. 

Xoo = his true score. 

e = a variable error, positive or negative. 

Then we can consider that 


X = Xoo + e 

or in deviation units 

X = Xao + e 

The variance of the obtained scores will be 

= <^^00 + (51) 

providing we can assume x^o and e imcorrelated. This assump¬ 
tion seems reasonable since the variable error, e, is supposed to be 
a chance affair, as often positive as negative, and therefore its 
magnitude and direction should not be related to anything else. 
Equation (51) can be stated in words: the variance of the distribu¬ 
tion of scores can be broken up into two portions, the variance 
of the true scores and the variance due to errors of measurement. 

Suppose that for a given trait we have two measurements, 
each of which is in error but ,not necessarily to the same extent 
or in the same direction. Symbolically, 

xi == Xoo + ei 
X2 =2:00 + 62 

in which the e’s represent the errors which go with the two ob¬ 
tained scores. The reliability coefficient is defined as the correla- 



Reliability 


129 


tion between two comparable measures of the same thing, i.e., 
the correlation between xi and X 2 . (Each measured individual 
will have an xi and an X 2 score.) Thus we have the reliability 
coefficient, 


^11 = = 


2 ^ 1 X 2 


Nai<T2 


S(a:oo + ei){xoo + ^2) 
Nai(T2 


00 4 " 2 a; 0062 H” 2 x oo6i + 2^162 


Dividing by N gives 


NaiC2 


^ 00 "i” ^ 00^2^ 00<^«2 ^ oeeiO" oo0’«, + 

ru =- (52) 

<Ti(T2 

If we assume all three r^s in the numerator equal to zero, we have 


rn = 


(ri<72 


It is assumed that we are correlating comparable measures of the 
same thing or trait—comparable in the sense that aei = o’e 2 , and 
<Ti = <r 2 - (The same trait is implied in that xi and X 2 are measures 
of Xoo.) Whence we have 



(52a) 


where o-a; = <ri = 02 - The reliability coefficient can be inter¬ 
preted as a proportion or percentage, since from formula (51) we 
have 


<^00 A 

O ' 9 


1 


i.e., the reliability coefficient represents the proportion of the 
variance of the obtained scores which is due to the variance of 
the true scores. It follows that 1 — rn gives the proportion of the 
variance which is due to errors of measurement. 

Obviously, the reliability coefficient can, by substitution from 
formula (52) into the above expression, also be written as 


^11 



(53) 



130 Factors Which Affect the Correlation Coefficient 


which indicates clearly that the reliability coefficient is a function 
of the magnitude of the variable error relative to the variability of 
the trait in question. It also follows from formula (53) that the 
error of measurement can be stated in terms of the reliability 
coefficient and a-*, thus, 

1 — ru (54) 

The reliability coefficient itself denotes the relative accuracy of 
measurement, and by (54) we can ascertain the absolute magni¬ 
tude of the errors. 

That (Te is to be interpreted as the standard error of measurement 
may be clarified if we note that, when x xi or X 2 ) is taken as 
evidence of the true score, x — x^o becomes the error, and the 
standard deviation of such errors will be ce, as can be shown by 
easy algebra (an exercise). If it were possible to secure a large 
number of measures on an individual, we would expect these 
measures to distribute themselves normally about the true score 
with a standard deviation corresponding to Cc. Thus, if the result 
of one testing yields an IQ of 80, and if ae = 3, we can conclude 
with high confidence that the individuaPs true IQ is somewhere 
between 71 and 89 (80 ± 3<re), and with fair confidence that it is 
somewhere between 74 and 86. It can readily be seen that the 
error of measurement expressed as a o- has a distinct advantage 
over such concepts as the mean or median error, or the mean or 
median difference between two measurements, in that <re enables 
us to use the probability table either in establishing confidence 
limits for a true score or in determining whether the scores of two 
individuals differ more than is to be expected on the basis of chance. 
The standard error of the difference between two obtained scores 
is (rc\/2- 

Determination of reliability. The above argument regarding 
the interpretation of the reliability coefficient either as an indica¬ 
tor of relative accuracy or in terms of <re rests on the supposition 
that we have obtained the reliability coefficient as the result of 
correlating comparable measures of the same thing and that the 
variable errors are uncorrelated with themselves and with the true 
scores. The practical determination of the reliability coefficient 
involves more, therefore, than the mere correlating of two sets of 
measurements. The conditions under which the two sets of scores 
are obtained must be scrutinized for possible violation of the 



Determination of Reliability 


131 


requisite assumptions. Some of the difficulties involved in ascer¬ 
taining the reliability of a psychological measurement are sug¬ 
gested in the following paragraphs. 

First let us note that the chance variable error, e, can be broken 
up into many smaller components at least logically, although not 
necessarily experimentally. Thus we might set 

e = Ca + ^6 + + <?/ + • * * 

in which Ca = error in the instrument or test. 

eh = error due to extraneous physical disturbance. 
ec = error due to physiological condition of individual. 
ed = error in scoring or in reading instrument. 

6/ = error due to day-to-day fluctuations. 

Other sources of variable error might be added, or some of those 
listed might be broken up into more minute parts. It is not 
assumed that these several sources contribute an equal amount to 
the variance of e, nor is it assumed that these several components 
are entirely independent of each other. For instance, daily fluc¬ 
tuations might be influenced by physiological condition. 

The assumption of uncorrelated errors implies that ei is not 
correlated with 62 . Of course the two scores for an individual 
might by chance contain a variable error of the same magnitude 
and sign; we are here, interested, however, in whether an error 
which is chance for one score might tend in general to affect the 
second score in the same manner. For example, an upset stomach 
might lead to a reduced performance score, and if the second test 
was administered the same day, this same chance factor would 
affect the second performance score in the same direction. Thus 
in examining any proposed scheme for determining the reliability 
of a test we must inquire as to whether any of the sources of error 
can affect the two measurements on an individual in the same 
direction. If it seems reasonable to suspect that errors are corre¬ 
lated, it follows that the obtained reliability coefficient will be 
spuriously high since the presence of correlated errors will not 
allow formula (52) to be reduced to (52a). 

Let us consider a few of the ^^accepted^' schemes for ascertaining 
reliability in order to see whether they are “acceptable” in light 
of the assumptions requisite to a sound reliability coefficient. 



132 Factors Which Affect the Correlation Coefficient 


These assumptions may be recapitulated in the form of three 
questions: Do the two tests or determinations represent measures 
of the same thing? Are the two series of measures comparable 
(comparable tests or instruments)? Is it possible or likely that 
the errors of measurement are correlated; i.e., can the error on 
the first test be correlated with the error on the second, or can 
the error on either be correlated with the true measure? 

For the ordinary mental, personality, or achievement test, 
reliability is usually ascertained by correlating supposedly equiva¬ 
lent (comparable?) forms, by correlating split halves (odd vs. even 
items or first half vs. second half of test), or by correlating test- 
retest scores. The test-retest method is of limited value in that 
there may be a memory carry-over from test to retest, in which 
case the retest will measure the same trait as the original test 
plus memory effects. In order to overcome this memory transfer, 
the retest may be administered some months after the first test, 
but this permits of a possible change in the trait or ability as a 
result of maturation or experience. 

Split-half reliability involves the correlating of two halves and 
applying the Brown-Spearman formula to determine the reliabil¬ 
ity of the whole test. (This formula gives the reliability of the 
whole test as twice the correlation between the halves divided by 
one plus the correlation between halves.) If the test items have 
been arranged according to difficulty, a first-half vs. second-half 
reliability will not satisfy the notion of comparable measures. 
Ordinarily the odd-even item technique will satisfy the criteria 
of comparability and sameness of trait. Neither of the split-half 
methods will satisfy the assumption of uncorrelated errors. Since 
both measures are determined at the same sitting, any chance 
fluctuations due to physiological conditions or to chance factors 
in the test situation will influence the two scores of an individual 
in the same direction. It is to be expected, therefore, that the 
correlation of halves will in general lead to a reliability coefficient 
which is too high, giving us an exaggerated notion of the accuracy 
with which we can place an individual on the trait continuum. 

By far the best method for determining the reliability of a test 
is to have two forms which have been made equivalent and com¬ 
parable by careful selection and balancing of items. No item in 
one form should be so nearly identical with an item in the other 
form as to permit a direct memory transfer. Two forms, equiva- 



Determination of Reliability 133 

lent yet not identical, can be administered within, say, two weeks^ 
time—a procedure which properly includes in the estimate of 
variable error the daily fluctuations due to either physiological 
or psychological conditions and variations due to chance factors 
in the physical situation in which the tests are given. With so 
short an interval between testings, the trait being measured will 
have changed only a negligible amount as a result of maturation 
or ordinary environmental influences. 

When we attempt to obtain the reliability of a learning score 
or of any performance which is influenced by practice, we encoun¬ 
ter difficulties which are baffling to the researcher who rigorously 
adheres to the fundamental requisites of the reliability coefficient. 
The chief difficulty is the obvious fact that the ^‘things ^ being 
measured changes as a result of each measurement or trial. Test- 
retest, or first half vs. second half (of trials), or today^s trials vs. 
tomorrow’s will not represent measures of the same function, nor 
will any scheme analogous to equivalent forms avoid this diffi¬ 
culty, since ‘^forms’’ which are comparable will permit transfer. 
The use of scores on odd vs. even trials will have the advantage 
of balancing somewhat the influence of practice, especially if 
several trials are given; but the possibility that a chance error 
affects odds and evens alike is present, in that a slip in the experi¬ 
mental procedure or a temporary discouragement on the part of 
the testee or the adoption by the subject of a poor approach to 
the problem will have a similar effect on both scores. If trials 
were spaced, say, a day apart, the factors just mentioned might 
not greatly disturb the reliability determination. In general, it 
can be said that the odd-even trial method will yield a reliability 
coefficient which is higher than the “true” reliability. 

The same shortcomings are present in the aforementioned 
methods when they are employed in determining the reliability 
of animal (or human) maze-learning scores. Other techniques, 
peculiar to the maze situation, have been proposed. Perform¬ 
ances on the odd and even blinds, somewhat similar to odd and 
even items, have been correlated for the purpose of reliability, 
but since blinds differ considerably as regards difficulty, one cannot 
be sure that the two halves are comparable. One can also ques¬ 
tion the comparability of the first half and second half of the maze, 
since in general the last part tends to be learned more quickly 
than the first. Attempts to ascertain the reliability of one maze 



134 Factors Which Affect the Correlation Coefficient 

by correlating performance on it with that on another maze involve 
several difficulties. In the first place, there seems to be a general 
positive transfer (perhaps a general adaptation to the maze situa¬ 
tion) from a first to a second maze; secondly, the second maze 
must be similar to the first in order to satisfy the requisite of 
comparable measures of the same ability, but if this similarity 
approaches identity the second maze becomes a retest; and thirdly, 
a close degree of similarity will lead to possible interference effects 
which may act differentially from animal to animal. 

The foregoing brief discussion of the requisites for, and diffi¬ 
culties in arriving at, a meaningful reliability coefficient should 
make obvious the necessity for examining critically any proposed 
method of determining the reliability of a psychological measure¬ 
ment. The interpretation of the reliability coefficient in terms 
of the standard error of measurement definitely assumes homosce- 
dasticity, which is another way of saying that the reliability 
coefficient is valid only when the error of measurement is of the 
same order of magnitude for the entire range of scores. That this 
may not always hold true is evident from findings with the 1937 
Stanford Revision of the Binet Test. 

It should be noted in passing that the magnitude of the reliabil¬ 
ity coefficient is influenced by the trait homogeneity of the sample 
upon which it is based. Let <r represent the standard deviation 
for the restricted range, S the standard deviation for the unre¬ 
stricted range, rn the reliability for the restricted, and Rn the 
reliability for the unrestricted; it may be assumed that 

aHl - rii) = S^d - (55) 

Thus, if we know o-, S, and either reliability coefficient, we can 
estimate the other. Relationship (55) assumes that the accuracy 
of measurement (absolute) is t^he same throughout the unrestricted 
range. 

Attenuation. Now we return to the question which led to this 
lengthy detour: How does imreliability affect the correlation be¬ 
tween variables? Let 

a: = Xoe + e 

y = 2/oo + d 



Attenuation 


135 


where e and d represent the variable errors in the two scores, x 
and y. Then 

S(a:oo + ^)( 2 /oo + d) 

Txy "' ■■■ 

N(Tx(Ty 

qo 2/ 00 “1“ ood "b Si/ "t* S^d 
N<Tx<Ty 

Ji we assume that d is uncorrelated with that e is uncorrelated 
with 2/ 00 , and that e and d are uncorrelated, we have 

^^ooVcc ^ cc cc^Xoo^yoo » i j j \ 

Txy =-= - (roooo = ^ between true scores) 

N<rx(ry (rx(Ty 

Since o-oo = ^^’Vrn by formula (52a), 



which, since the reliability coefficients are less than unity, shows 
clearly that the correlation between obtained scores will be less 
than that between true scores; i.e., errors of measurement tend 
to reduce or attenuate the correlation between traits. 

One can rearrange formula (56) as 


^ 00 00 - 




®lX 2 ^ 'V\V2 


(57) 


by which one can estimate what the correlation would be if perfect, 
errorless, measures were available. This is known as correction for 
attenuation. Correlation coefficients corrected for attenuation are 
of theoretical importance in the analysis of relationships in that 
allowance can be made for variable errors of measurement, but 
such corrected r^s are of little practical value since they cannot 
be used in prediction equations. The prediction of one variable 
from another and the accompanying error of estimate must neces¬ 
sarily be based on obtained, or fallible, rather than true scores. 

Since the correlation between variables is a function of the 
reliability of their measurement, we may examine the limits im¬ 
posed upon r as a result of fallible scores. By reference to formula 



136 Factors Which Affect the Correlation Coefficient 

(66), we observe that, if the correlation between true scores is 
unity and if the reliability for one variable is perfect, the obtained 
correlation between the two cannot exceed the square root of the 
reliability coefficient for the other variable. If the correlation 
between the true scores is perfect and if each variable is subject 
to errors of measurement, then the obtained correlation cannot 
exceed the product of the square roots of the two reliability coeffi¬ 
cients. Obviously, if the reliabilities are the same, the obtained 
correlation cannot be greater than the reliability coefficient. 

In addition to the assumptions which were made specifically in 
deriving the formula for correcting for attenuation, it is also neces¬ 
sary to meet all the assumptions required for a sound reliability 
coefficient. Since obtained correlations and also reliability coeffi¬ 
cients are functions of the homogeneity, with respect to the two 
traits, of the sample upon which they are based, it follows that 
the reliability coefficients used in correcting an obtained r should 
be based on the same sample as r or on a sample which is of com¬ 
parable homogeneity. Corrected r’s greatly in excess of unity 
have been reported. Such absurd results lead one to ask whether 
the assumptions have been met, but this question should be raised 
concerning any corrected r, even though it does not exceed unity, 
since the assumptions are difficult to meet. It has been said that 
a corrected r can legitimately exceed unity by as much as two or 
three times its sampling error. Formulas for the standard error 
of a corrected r are available, but nothing is known concerning the 
nature of the distribution of corrected r’s for successive samples. 
Presumably this distribution would be markedly skewed for high 
values; hence the use of an ordinary standard error technique to 
determine whether a corrected r exceeds unity (or any other magni¬ 
tude) by more than can reasonably be expected on the basis of 
sampling is an unsound procedure. 

INDEX CORRELATION 

A possible source of error in correlational work may be intro¬ 
duced when two indexes having a common variable denominator 
are correlated, such as XjZ and YjZ. Before considering this 
special case, it might be well to turn our attention to more general 
formulas for indexes. These formulas involve the coefficient of 
variation, namely, v = <r/M, and their use leads to serious error 



Index Correlation 137 

when the v^s are large—and higher-power terms having been 
dropped in the derivations. 

Let I = X 1 /X 2 ; then it can be shown that the mean and stand¬ 
ard deviation of such an index or ratio will be approximately 

Ml 

Afj = — (1 - ri2»it>2 + 1 ^ 2 ) (58) 

M2 



If we have four variables, the following formula for the correla¬ 
tion of indexes will yield a good approximation: 


r 

'Xi Xi 


ri2VlV2 - ri4ViV4 - r23V2Vs + ^ 34 ^ 3^4 



(60) 


Although these formulas are very useful for determining means, 
sigmas, and the correlations for ratios in terms of means, sigmas, 
and correlation coefficients for the original variables, their use is 
somewhat limited in that generally one cannot know whether 
the index distribution is normal, nor can one make a statement 
concerning linearity and homoscedasticity for the correlation be¬ 
tween two indexes. Such information, if needed, must be obtained 
by first determining the numerical value of the indexes for each 
individual and then making distributions. 

Several special cases can be deduced from formula (60). Thus 
the correlation between X 1 /X 3 and X 2 is exactly equivalent to 
that between X 1 /X 3 and X 2 /I; i.e., X 4 is set equal to 1 , which 
makes V 4 = 0 , and therefore all terms in (60) involving the sub¬ 
script 4 vanish. The correlation between X 1 /X 3 and the recipro¬ 
cal of a variable would be obtained by setting X 2 = 1 , i.e., letting 
I/X 4 be the reciprocal; then V 2 = 0, whence the desired formula 
can be obtained by dropping all terms involving V 2 . Likewise the 
correlation can be deduced for I/X 3 with I/X 4 , for 1 /X 3 'with X 2 , 
and for X 1 /X 3 with X 2 /X 3 . This last correlation is of particular 
interest because it is possible to find a relationship between these 
two indexes even though the three original variables are uncorre¬ 
lated. 



138 Factors Which Affect the Correlation Coefficient 


By substituting X 3 for X 4 , i.e., replacing subscript 4 by 3, an 
expression for the correlation of indexes having a common variable 
denominator can readily be obtained. It will be 


Xi Xi 


_ ri2VlV2 — risViVs — r23V2V3 + _ 

— 2risViV2^/v% + v\ — 2r22V2H 


( 61 ) 


If ri 2 = ri 3 = r 23 = 0, this becomes 

V v^i + v^2 + 

and if the v’s are equal, the value of the index correlation will be 
.50 even though there is no relationship between the original varia¬ 
bles. This is known as spurious correlation due to indexes. There 
are instances, however, in which an analysis of the interrelations 
of ratios is of just as much import as the analysis of the variables 
from which the indexes are obtained, and therefore it does not 
follow that the correlation between ratios having a common de¬ 
nominator is necessarily misleading. 

It has been asserted that the correlation between IQ^s derived 
from two tests or two forms of the same test will be spuriously 
high because of the common variable denominator, age. It can 
be shown, however, that such a correlation wiU not be spurious 
unless the two sets of IQ's are correlated with age. If the IQ-vs.- 
age correlations are both positive or both negative, the index 
correlation will be spuriously high; if one is negative and the other 
positive, spuriously low. Thus, rather than make a blanket state¬ 
ment to the effect that the correlation between IQ's is spuriously 
high, we should say that it can be spuriously high or low or not 
spurious at all, according to the IQ-vs.-age correlations. It should 
be remembered that, even .though the IQ's based on an ideal 
(properly constructed and standardized) test will be uncorrelated 
with age, a nonzero relationship might be produced for a single 
school-grade group by the selective factors that operate in age- 
grade location. Within a single grade group in a school system 
where acceleration is permitted, the younger children are likely 
to be the brighter, i.e., have the higher IQ's, thus producing nega¬ 
tive correlations for sets of IQ's with age, and consequently a 
spuriously high correlation between IQ's. 



Heterogeneity with Respect to a Third Variable 139 


PART-WHOLE CORRELATION 

Another type of spurious correlation arises when a total score is 
correlated with a subscore which is a part of the total score. Sup¬ 
pose that a total score is made up of three parts, Xt = Xi + X 2 
+ X 3 , and that we correlate Xi against Xe. Ordinarily in such 
situations the components will themselves be correlated positively. 
It should be obvious that the extent to which Xi correlates with 
Xt is more or less dependent upon the fact that Xt includes Xi. 
It does not follow, however, that a high value for is not mean¬ 
ingful, even though spurious. For instance, a high value for ru 
would, regardless of spuriousness, justify the use of Xi in lieu of 
the battery of three subtests. There are times when one may wish 
to know how highly a subtest correlates with a total, based on any 
number of parts, minus the subtest. This correlation is given by 

7 / 2" ^ - 

V + cr^i — 2rit<Ti<Tt 


HETEROGENEITY WITH RESPECT TO A THIRD VARIABLE 

We have already discussed the influence on r of heterogeneity 
with regard to one or both the variables being correlated. Suppose 
variables Xi and X 2 are two different traits, each of which is re¬ 
lated to age as the third variable. Then an older individual will 
tend to be higher on both tests than a younger individual. In 
other words heterogeneity with respect to age will tend to produce 
correlation between Xi and X 2 , and our present problem is to 
develop a method for correcting ri 2 so that we can estimate what 
the correlation between Xi and X 2 would be if age were constant. 

Suppose ri 2 , ri 3 , r 23 , and the several means and standard devia¬ 
tions are known; then let us visualize the three scatter diagrams. 
The scatter for ri 2 will be somewhat elongated as a result of the 
influence of age, since variation in both Xi and X 2 are here sup¬ 
posed to be partly due to age variation. What is needed is the 
correlation, between measures of Xi and X 2 , which has been freed 
from the influence of age. If we were to express each Xi in the 
first array of the scatter for ri 3 as a deviation from the mean of 
this array and were to do the same for all other Xi^s in the scatter— 
each as a deviation from the mean of the array in which it falls— 
we would have scores expressed as deviations from the means of 



140 Factors Which Affect the Correlation Coefficient 


the several ages. These deviations will be independent of age. As 
an example, suppose an 8-year-old individual scores 28 and the 
mean of 8-year-olds is 26 , and a 14 -year-old individual scores 54 
and the mean of 14 -year-olds is 51 . The second individual scores 
higher than the first because he is older, but each would have a 
deviation (from his own age mean) of plus 3 . Obviously, if we 
also expressed the X2 scores as deviations from the averages for 
the several ages, they too would be independent of age influences. 
Now, if we correlated these deviations-from-age means, we would 
be correlating sets of Xi and X2 scores which would be free from 
age, and hence we would arrive at a correlation, between variables 
Xi and X2, which would not be affected by age heterogeneity. 

Partial correlation. The task of determining the correlation 
between two variables, with the influence of a third eliminated, 
can always be accomplished by actually computing all the devia¬ 
tions and then making a scatter diagram from which the r can be 
determined; but, in those cases in which we can assume linearity 
of regression for Xi on X3 and X2 on X3, it is possible to set up a 
method for determining the desired correlation from the three 
correlation coeiBSicients between the three variables. If linearity 
exists, we can correlate the deviations from the two regression 
lines instead of from the array means (or means for several ages 
if age is the third variable). Since 

<Ti ( 7’2 

x't = rt3 — X3 and x'2 = r 23 — X3 

0'S Os 

the two sets of deviation-from-regression scores will be 


xi — rc 1 = — ri3 — xs and X2 — Xs = X2 — r 23 — Xs 

Os Os 


The correlation of these deviation scores, which is designated by 
the s3anbol ri2.3 (read: the correlation between Xi and X2 with 
Xs held constant) and known as the 'partial correlation coefficient^ 
becomes 

X(xi - x\){X2 - x'2) 

Oxi — x'lOx2 — z'2 



01 V 02 \ 

ri3 — ^3 )[ ^2 - ^23 -'^3/ 
O3 /\ _ 03 / 


Ox2--X'2 


^ 12-3 = 



Partial Correlation 


141 


Multiplying and summing the numerator, and noting that the 
^8 in the denominator are nothing more than the errors of esti¬ 
mate, (Ti-a and 0-2.a, we have 

2xiX2 — r 23 — 2x1X3 — ri 3 — 2x2X3 + ri 3 r 23 2x^3 
^ 0^3 (T3 a 3 _ 

Nai's/1 — r^i3 cTa^/1 — r^23 

Dividing by N, cancelling <r’s, and collecting like terms, we get 


^12-3 


ri2 - ri 3 r 2 s 

Vl - r^iaVl - 7^23 


(62) 


This formula definitely assumes the linearity of the two regres¬ 
sion lines for predicting Xi and X2 from X3. Whether we corre¬ 
late deviations from array means or use formula (62), we end with 
a correlation which has been freed of the influence of the third, or 
eliminated, variable. If, for example, age is the third variable, 
the partial correlation coefficient represents an estimate of what 
the correlation would be if we held age constant by the use of 
individuals of any one of the several age levels present in the orig¬ 
inal group. 

The difference between ri2.3 and ri2 indicates how much of the 
correlation between variables 1 and 2 is due to the influence of 
heterogeneity of a third variable. Obviously, if the third variable 
is unrelated to Xi and"X2, the partial r will equal ri2, and if either 
ri3 or r23 is negative and ri2 positive, “partialing out^^ X3 will 
raise the correlation. Is this reasonable? 

The difficulties one encounters in determining the direction of 
causation make it necessary to be careful in the use of the partial 
correlation technique. If it is logical to consider holding a variable 
constant experimentally, then it is reasonable to use the partial 
correlation method when experimental control is not feasible. 
The technique can be extended for ‘‘partialing out^’ or eliminating 
more than one variable. Thus, to obtain an estimate of ri2 with 
X3 and X4 held constant, we can use 


^12-34 


^12 4 "" n3-47'23»4 
"^1 *“ ^13'4'V^l ~ ^23-4 


which is in terms of first-order partials calculable by formula (62). 



142 Factors Which Affect the Correlation Coefficient 


The sampling error of the partial coefficient may be handled by 
the z transf ormation . The standard error of the corresponding z 
will be 1 /V^iV — 4 when only one variable has been eliminated, 
and I/^/n — 5 when two variables have been eliminated. For 
N less than 30, see Chapter 12. 

A perplexing and often-recurring question with regard to the 
interrelations of three variables is this: Are the correlations con¬ 
sistent among themselves, or, if ri 2 and ri 3 are known, what are 
the possible limits for r 23 ? If ri 2 = unity and ri 3 = unity, r 23 
must also equal unity, but, if ri 2 = 0 and ri 3 = 0, does it follow 
that r 23 = 0? It can be shown that the limits for the correlation 
r 23 will always be ri 2 riz dz Vl — r^i 2 — r^i 3 + r^i2^^i3- 

Examples: 

When ri 2 and ria each equal .90, the limits for r^z are +.62 and +1.00; 

“ “ “ “ “ " .60, « “ « « « -.50 and +1.00; 

" " " " " " .25, " “ « « « -.876 and+1.00. 

SUMMARY 

In this chapter, consideration has been given to factors which 
have a bearing on the magnitude of the correlation coefficient. If 
any of these is operative in the case of a particular coefficient, it 
is the responsibility of the investigator to qualify his conclusions 
accordingly. Published reports of correlational studies should 
include: 

а. A definition of the population being sampled and a statement 
of the method used in drawing the sample. 

б. The size of the sample and an adequate treatment of sampling 
by means of nonantiquated formulas. 

c. The means and particularly the standard deviations of the 
variables being correlated, with some indication as to whether the 
sample is typical as regards heterogeneity with respect to the 
variables under consideration. 

d. The reliability coefficients for the measures and the method 
of determining reliability. 

e. A statement relative to the homogeneity of the sample with 
respect to possibly relevant variables such as age, sex, race. 

/. A defense or precise interpretation of any reported correla¬ 
tions involving indexes or of any part-whole correlations. 



Summary 


143 


The researcher who is cognizant of the assumptions requisite 
for a given interpretation of a correlation coefficient and who is 
also fully aware of the many factors which may affect its magni¬ 
tude will not regard the correlational technique as an easy road 
to scientific discovery. 



CHAPTER 9 


Multiple Correlation 


So far our discussion of correlation has been concerned chiefly 
with the prediction of one variable from another or the attributing 
of a portion of the variance of one variable to the action of a second 
variable. We shall next consider the case where it is desired to 
predict one variable by using several other variables as a team of 
predictors, or where, if causation can be assumed, an attempt is 
made to analyze the variance for one variable into components or 
parts attributable to the action of two or more other variables. 
There is a close connection between the predicting and the analyz¬ 
ing problems; let us first consider the method of predicting one 
variable on the basis of other variables. 

THE THREE-YARIABLE PROBLEM 

For simplicity, consider the problem of predicting Xi from a 
knowledge of X 2 and X 3 . The Xi variable is frequently called the 
criterion, or dependent variable. If we had Xi to be predicted 
from X 2 alone, we would have exactly the same situation as pre¬ 
dicting Y from X. That is, the linear prediction equation (in 
gross score form) 

r = BX +A 

becomes 

X'l =BX2 +A 

and the deviation form 

f/' = bx + a 

becomes 

x\ = bx 2 + a 

It will be recalled that the values of the constants, B and A, orb 
and a, were so determined as to give the maximum predictability, 

144 



The Three-variable Problem 


145 


and that B and A turned out to be functions of the correlation 
coefficient between the two variables and of the means and stand¬ 
ard deviations for the variables. The equation which resulted from 
giving A and B specific values was said to be the equation of the 
best-fitting line—the error of prediction was minimized. 

Now, if we wish to predict Xi from X2 and X3, we start with an 
equation of the form 


X'l = B2X2 + B3X3 + A 
which can be written in deviation units as 


( 63 ) 


x'l = 62^2 + bsXs + a 

Either of these forms represents the equation of a plane. It can 
be shown that B2 = 62 and B^ = 63. In fact, this is rather obvious 
when we consider the meaning of these B or b coefiicients. They 
represent the slope of the plane; B2 is the slope which the plane 
makes with the X2 axis, and JB3 the slope with regard to the x^ 
axis. When we shift from raw to deviation scores, we are merely 
shifting the origin, or point of reference, to the intersection of the 
means, and this point in terms of deviation scores becomes zero. 
This shift of the frame of reference does not change the position 
or angle of the plane; hence B2 = &2 and B3 = 63. (The student 
will recall that, for the ordinary two-variable problem, the slope 
of the line was equal to^B or 6.) 

It remains to attach meaning to A and a. In the equation 
Y' = BX + A, it was noted that the constant A was the Y inter¬ 
cept, i.e., the value of Y where the line cut the y axis. It was also 
found that a = 0; i.e., that in the deviation form the line cut the 
y axis at the origin. Perhaps the student has already anticipated, 
by analogy, that the A in our three-variable equation is the value 
of Xi where the plane cuts the xi axis, and that the value of a will 
become zero. 

Before going farther, it might be well to take a look at the prob¬ 
lem geometrically. In the case of two variables, after plotting 
the X and Y values in a scattergram, we can readily picture the 
meaning of B and A, and also obtain some notion of why certain 
values of B and A will lead to better predictions than those ob¬ 
tained by other values. In the case of three variables, Xi, X2, 
and X3, we have a trio instead of a pair of measurements. In 



146 


Multiple Correlation 


order to draw up a plot of N such sets of measurements, we will 
need to use a three-dimensional scheme. Instead of placing a 
tally mark in a cell defined by an interval along the x axis and one 
along the y axis, we now have to consider a cell as defined by inter¬ 
vals on the xi, the X2, and the x^ axes. Instead of a square cell, 
we have a cubical cell. 

Suppose an individuars three scores fall in intervals Zi, Z2, and 
is] then his ‘‘tally” will be placed in the cubicle formed at the inter¬ 
section of these three intervals. The total number of cubicles will 
be the product of the number of intervals on each axis, and an 
individual's location in the “box” will depend upon all three of 
his scores. The student may be at a loss to know just how one 
could make such a three-dimensional scattergram. Actually, 
this diagram is not necessary, but it is of interest to imagine what 
such a three-way distribution would look like. If the correlations, 
^12, ^13, and r23, are fairly high (and positive), and if we think of 
the frequencies in the several cubicles as being represented by 
dots (or different degrees of density), then the swarm of dots will 
extend from the lower left front to the upper right back of the box. 
The greatest density will be at the center of this swarm, and the 
density or frequency will fall off in all directions from the center. 
The swarm will have the general shape and appearance of a water¬ 
melon (ellipsoidal). 

Imagine that a plane is to be cut through this swarm. Our job 
is to so locate the plane that, when we start upward vertically 
from any point on the bottom of the box, say the spot defined by 
any pair of values for X2 and X3, we will find that the altitude, 
i.e., the distance along the xi axis at which the plane is reached, 
will constitute the best estimate of Xi for individuals having any 
given X2 and X3 scores. With a little reflection, the reader can 
see that, of many ways of placing the plane, some positions will 
obviously give very poor estimates, whereas others will lead to 
better estimates. What we need is that plane which, for the given 
N sets of Xif X2, and X3 scores, will yield the best possible esti¬ 
mates. 

The criterion of “best” is a least square affair—^the sum of the 
squares of the errors of estimate shall be a minimum. The task 
is really that of determining the values of A, B2, and B3 in for¬ 
mula ( 63 ) so that 

S(Xi - X'i)2 



Derivation of Regression Equations 


147 


is a minimum. That is, we are to assign to A, B2, and B3 those 
values which will pennit the best possible estimate of an unknown 
Xi when we know the X2 and X3 values for the individual. The 
principle to be used is exactly the same as that employed to obtain 
the optimum value for B and A for the two-variable problem, but 
the present problem is more complicated because we have to deter¬ 
mine the values for three constants. 

Derivation of regression equations. Our task is simplified if 
deviation scores are used, and we assume a = 0 (if we carried a 
along, it would prove to be zero). It is simplified somewhat more 
if we transform all three sets of scores into standard score form, 
i.e., if we set z = (X — M)f(T. Then our equation becomes 

z'l = ^ 2^2 + ^323 (64) 

It should be noted that, since we are changing the size of our unit 
of measure, it cannot be argued that ^2 will equal B2 or i>2* The 
task now is to determine the value of the beta coefficients^ ^2 a*nd 
183, so as to have the best possible estimate of 21, or so that the 
average of the squared errors, or 

shall be a minimum. Since Zi — z\ — Zi — ^2^2 the func¬ 

tion, /, to be minimized is 

1 

/ = — 2 (zi - fiiZz - P3Z3) 

N 


To determine the values of 02 and 03 which will make this function 
a minimum, use is made of the calculus. We take the partial 
derivative of the function first with respect to 02, then mth respect 
to 03. Thus, 

df -2Sz2 , 

— = —^ (*i - ^ 2 Z 2 - 03Z3) 

602 N 


603 


—2S23 

—— (Zi ~ ^2^2 “ ^23) 

N 


These two derivatives are to be set equal to zero and then solved 
simultaneously for the two unknowns, ^2 and 183. Performing the 



148 Multiple Ck>rrelation 


indicated multiplications, summing, and dividing each equation 
by 2, we get 


— S2i22 
N 

—'SziZs 

N 


+ 02 

+ 02 


22*2 

~ir 

22223 

~N~ 


+1 


+ 03 


22223 

‘"IT 

N 


= 0 

= 0 


Since we are dealing with standard scores, we can now capitalize 
on certain properties thereof, namely, that the sum of their squares 
divided by N is unity, while any sum of cross products divided by 
N is the correlation between the two variables involved in the 
cross products. Thus, we have 


or 


“^12 + ^2 + ^ 3^23 = 0 

”^13 + P2^23 + ^ 3=0 

02 + ^23^3 — ri2 = 0 
^2302 + — ri3 = 0 


(65) 


Since the r^s in the equations are determinable for any given sam¬ 
ple of data, they are in effect knowns, whereas the 0 ^s are un¬ 
knowns, We therefore have two simultaneous equations with 
two unknowns. These can readily be solved by a number of 
methods which the student will find in an algebra textbook. 
Straightforward solution gives 


02 = 


7*12 - y’13^23 


1 — r" 


23 


^ ri3 - ^ 12^23 

0^ 1 2 

1 - ^ 23 

As soon as we have computed the r^s, we can easily determine 
the 0 ^s, The obtained numerical values can then be substituted 
in the prediction equation 

Z\ = 02^2 + 03^3 

so that for a given pair of Z2 and 23 values we can predict the stand¬ 
ard score on the criterion variable. However, in practice it is 
ordinarily more convenient to deal with raw scores; hence we need 



Error of Estimate 


149 


our prediction equation in raw score form. Obviously, if we re¬ 
place the in the above equation by their values in terms of raw 
scores, means, and standard deviations, we will have 

X'l - Ml X2 - M2 X3 - M3 
-= P2 -+ - 

<ri <72 <73 

or 

X'l Ml X2 M2 , X3 M3 

-“ ^2- P 2 -h ^3-- 

<7i <7i <72 <72 <73 <73 

Multiplying by <7i and rearranging terms, we have 


X'l 


P2 — ^2 4 " ^3 — ^3 + ( Ml —02 — M2 —03 — M3 j (66) 

<72 O’S \ (^2 O'S / 


from which we see that our original B2 must equal 02((Ti/<T2)f 
B3 = |33(<ri/<73), and A = the parentheses term. Thus we can 
readily determine the numerical values of B2, B^, and A and 
thereby have the constants for the prediction equation. Actually, 
the values of B2 and £3 are the optimum weights to be assigned 
to X2 and X3 in order to predict Xi. 

Error of estimate. The accuracy of the prediction of Xi by the 
best combination of X2 and X3 can be ascertained by examining 
the error term, i.e., Xi — X\ or <71(21 — 2'i). The sum of the 
squares for the errors divided by N will yield the variance of the 
errors. The square root would correspond to the standard error 
of estimate. Let <721,2, be this error (in sigma units), then 

, 2(21 - z\f 


S(Zi - 

N 

Sz^l 2 ^2^2 2 ^^3 2/32SZi2!2 203SZiZ3 

~rr + ^ 2-T7- + P 3 — -;;-=— 


2 g 2 i 33 SZ 2 g 3 

N 

= 1 + — 202»'12 ~ 2 ^ 3»‘13 + ^^ 203^23 



150 Multiple Correlation 

which by algebraic manipulation reduces to 

= 1 - 082^12 + Psns) (67) 

in terms of standard scores. Then a^i times this would give the 
error variance for raw scores. 

Multiple r. We next define the multiple correlation coefficient 
as the correlation between Zi and the best estimate of Zi from a 
knowledge of Z 2 and Zz. In symbols, 


^ 1-23 = T^ziz*i = 


N(Jz^(Tz>^ 


2^1 (02^2 + &zZz) 




( 68 ) 


Note that, although <Tzx = 1, it does not follow that Oz^y = 1. In 
order to evaluate this last <r, we write 

Zi = Z'l + Zi.23 

That is, we think of Zi as being made up of two parts, that which 
we can estimate plus a residual. It can easily be shown that these 
two parts are independent of each other; hence by the variance 
theorem we have 

or 


then 


1 = C^z'i + 
= 1 - 


— X — u Zl.21 

But a^zi. 2 i Is nothing more than the variance of the prediction errors 
as given by (67); therefore 

<^z\ = + ^ 3^13 

Then, by substituting in formula (68), we have 
Szi(^ 2^2 + ^ 2 : 3 ) 


^1-23 = 


NVfiiriz + Paris 

P2^ZiZ2 + Ps^ZiZs ^ 2^12 + ^ 3^13 


NVfi2ri2 + 03^13 V/32'‘12 + 


= ■V^/32ri2 + ^3^13 


(69) 



Relative Weights 


151 


We thus see that, as soon as the are determined, we can write 
the regression equation for predicting Zi from Z2 and Zs and can 
also specify the degree of correlation and calculate the error of 
estimate. This error obviously can be written from formulas ( 67 ) 
and ( 69 ) as 

<^ 1-23 = 1 ““ ^^ 1-23 ( 70 ) 


which is in terms of raw scores. 

Formula ( 70 ) has been used frequently to define the multiple 
correlation coefficient. Stated explicitly. 


2 . 0 -^ 1-23 . 2 

^^1-23 = 1- 2 — = 1 ““ 0^* 


Then, by substituting from ( 67 ), we again arrive at ( 69 ). 

The student will note the similarity of formula ( 70 ) to the ordi¬ 
nary error of estimate for the bivariate situation. Thus the multi¬ 
ple correlation coefficient can be interpreted, in terms of reduction 
in the error of estimate, in exactly the same manner as the ordinary 
bivariate correlation coefficient. The only difference is that we 
are now determining the regression coefficients, or weights for 
two variables as a team, so as to get the best possible prediction of 
a third variable, whereas in the bivariate situation only one regres¬ 
sion coefficient is necessary. A multiple correlation coefficient of 
.60 has, aside from minor qualifications to be discussed later, the 
same meaning in a predictive sense as an ordinary correlation of 
. 60 . Furthermore, the interpretation in terms of contribution to 
variance also holds for the multiple correlation coefficient; i.e., if 
one can assume causation, it may be said that a multiple r of .60 
indicates that 36 per cent of the variance in the criterion or de¬ 
pendent variable can be attributed to variation in the two inde¬ 
pendent variables. 

Relative weights. The question arises as to the relative im¬ 
portance of the two variables as contributors to variation in the 
criterion variable. The B coefficients in the regression equation 
have, at times, been misinterpreted as indicating the relative 
contribution of the two independent variables. The reader need 
only be reminded that the two B coefficients usually involve dif¬ 
ferent units of measurement (one may be in terms of feet and the 
other in pounds); hence they are not comparable at all. If B2 is 
numerically twice B3, it does not follow that X2 is twice as im- 



152 


Multiple Correlation 


portant as X 3 . In order to get around this difficulty, we must 
think in terms of standard scores; these will be comparable, and 
hence the /3 coefficients in the standard score form of the regression 
equation will be comparable. 

Since 


or 

and 

it follows that 


1 = 

1 “ <^^* 1.28 ~ ^ 1*23 
^^ 1*23 = 


That is, r^i. 23 , which corresponds to the percentage of variance 
explained, is equal to or the variance of the predicted stand¬ 
ard scores. This variance could be determined by actually making 
iV predictions of Zx from the iV pairs of values of Z 2 and z^ and then 
computing the sigma for the distribution of these predicted values. 
This is not done in practice, since the value of this sigma squared 
is r^i. 23 , which is easily calculated once the jS's have been deter¬ 
mined. 

But note that, since 

z\ = ^2^2 + 

we can indicate the value of as 

. Wx? 2(feZ2 + 

^ N N 


P^2 ^^2 + Z 4 " 2P20Z^^2^Z 


which becomes 


N 


O^z'l = + 0^z + 2^2^3r23 (71) 

In other words, the predicted variance, which corresponds to the 
‘‘explained’^ variance, can be broken down into three additive 
components. We thus see that the relative importance of the 
variables X 2 and X 3 in ‘‘explaining” or “causing” variation in 
Xi can be judged by the magnitude of the squares of the j 8 coeffi¬ 
cients. The third term in formula (71) represents a joint contribu¬ 
tion which, it will be seen, is a function of the amount of correla¬ 
tion between the two predicting variables. 



More Than Three Variables 


153 


Summarizing, it can be said that the fundamental problem in 
multiple correlation is that of obtaining the optimum weighting 
to be assigned to independent variables (X 2 and Xs) in predicting 
or explaining variation in a dependent variable, Xi. That is, we 
determine the value of B 2 , - 63 , and A in the equation 

X'l = B2X2 + BsXs + A 

so as to get the best possible estimate of Xi. This is resolved by 
working with the prediction equation in standard score form with 
P coefficients. The value of each p is determinable from the inter¬ 
correlations among the three variables. Once the p's are calcu¬ 
lated, we can: (1) readily compute the B coefficients needed in the 
raw score form of the prediction equation; ( 2 ) determine the value 
of the multiple correlation coefficient and the error of estimate; 
( 3 ) ascertain the relative importance of the independent variables 
as predictors or, if causation can be assumed, as contributors to 
the variance of the dependent or criterion variable. It is important 
to note that the multiple correlation coefficient represents the 
maximum correlation to be expected between the dependent 
variable and a linearly additive combination of X 2 and X 3 . 


MORE THAN THREE VARIABLES 

Suppose that we have a dependent variable and four independent 
variables which might 'he used as predictors or which might be 
thought of as causes of variation in the dependent variable. The 
cause and effect, as opposed to concomitant, relationship among 
variables is a logical problem which must be faced by the investi¬ 
gator as a logician rather than as a statistician. Whether one 
resorts to the multiple correlation technique as an aid in predicting 
or as an aid in analysis will depend entirely upon the problem 
being attacked; the mechanical solution is the same, but the investi¬ 
gator must choose the interpretation which best suits his purpose. 

For a five-variable problem, we need the constants in the regres¬ 
sion or prediction equation, 

X'l = B 2 X 2 + B^Xs + B 4 X 4 + ^5^5 + A 
which can be written in standard score form as 
z'l = P2Z2 + ^3^3 + ^4^4 + 



154 


Multiple Correlation 


As in the three-variable situation, the problem is that of deter¬ 
mining the optimum values of the or the jS^s so as to get the 
best possible prediction of Xi or 2 i, i.e., so that 

S(Xi - X\)^ 

N 

or 

2(21 - z'l)^ 

N 

shall be as small as possible. The mathematical solution is easier 
by way of the standard score form of the regression equation. We 
have the function 

S(2l - z'l)® S(2i - /3222 - PsZz “ P 4 Z 4 “ 

/- ^ -S- 

which is to be minimized by assigning proper values to the jS^s. 
These values are obtained by taking the derivative of the function 
with respect to, and in order for, each of the jS’s. This will yield 
four derivatives which when set equal to zero will give us four 
equations involving the four unknown /3^s. These equations can 
then be solved as simultaneous equations in order to determine 
the values of the /S's. The obtained jS^s will be such that the sum 
of the squares of zi — z\ will be the least possible; i.e., we will 
have the best possible estimate of Zi from an additive combination 
of the four independent variables. 

The student of the calculus can readily verify that the four equa¬ 
tions obtained by taking derivatives of formula (72) will take the 
following form (when set equal to zero): 

^2 + 03^23 + ^ 4^24 + 06^26 7*12 = 0 

02^23 + ^3 “b 04^34 + 05^35 “ ri3 = 0 

02^24 + 03^34 + 04 + ^67*46 “ 7-14 = 0 

02^25 + 03^35 + ^47*45 + ^5 7*15 = 0 

These equations result from steps exactly parallel to those used 
for the three-variable problem. The four 0^8 are unknowns, 
whereas, for any given batch of data, the r’s take on specific nu¬ 
merical values. 




More Than Three Variables 


155 


The extension of multiple correlation to include any number of 
variables involves the same principles as utilized here for the 
three- and the five-variable problem. For n independent variables, 
formula ( 64 ) becomes 

z\ = ^2^2 + ^3^3 H-h ( 64 a) 

The extension of (66) as the gross score equation should be obvious. 
Formula ( 69 ) for the multiple correlation coefficient becomes 

, ri .23 ... „ = Vj82ri2 + /Sana H- 1 - j8„ri„ ( 69 a) 

To solve for the unknown / 3 ^s, the student may resort to any of 
the schemes given in algebra textbooks for solving simultaneous 
equations. One method is by way of determinants and Cramer’s 
rule. The coefficients of the unknowns are the intercorrelations 
among the four independent variables, whereas the constants in 
these equations are the respective correlations of the dependent 
with the independent variables. In the application of Cramer’s 
rule, these constants are thought of as being on the right-hand 
side of the equation, i.e., shifted to the right of the equality mark, 
with the consequent change of sign. The student should keep 
in mind, however, the fact that the original sign of any of the 
computed correlation coefficients must be considered. 

Solution by Cramer’s nile becomes quite tedious and burden¬ 
some for a problem involving more than four or five variables. 
Indeed, this determinantal solution is practically impossible for 
problems involving a large number of variables. Fortunately, 
there is available a simplified solution, but before turning to it, 
we would like to indicate some algebraic manipulations in terms 
of determinants. 

It will be noted from the above simultaneous equations that all 
the intercorrelations among the five variables are involved. One 
can conveniently arrange these correlations in a table, or in deter¬ 
minantal form. Thus we can define a major determinant as 

I ri2 ri3 ri4 ris 

ri2 1 ^23 .^24 ^25 

-D = ^13 ^23 1 ^-34 ^35 

^14 ^24 ^34 1 ^45 

^15 ^25 ^35 ^45 1 



156 


Multiple Correlation 


If we were to delete the first row and first column, the minor 
which remains would involve the intercorrelations among the four 
independent variables. This minor might be conveniently sym¬ 
bolized as Dll; he., we have deleted the column and the row which 
involves the subscript 1. If we were to delete the row which 
involves the subscript 1 and the column involving the subscript 2 
throughout, we would symbolize the resulting minor as D 12 . 

Now it can be shown that 


= 


D 

D 


12 


or any j5, say will be 


11 


Pp = (“D^ 


ip 


11 


where the quantity (—1)^ is an indicator of either a positive or a 
negative sign, but the ultimate sign of fip is also dependent upon 
whether the numerical values of the determinants are positive or 
negative. It can also be shown that the multiple correlation coeffi¬ 
cient can be written as a function of determinants, thus 


r^l-2345 



The student who is interested in following a treatment of multi¬ 
ple correlation in terms of determinants is referred to T. L. Kelley^s 
Statistical method* 


NUMERICAL SOLUTION 

The solution of the simultaneous equations for the unknown 
/3's can best be accomplished by resort to Doolittle's method. This 
method is applicable to the solution of any simultaneous equations 
involving a major determinant which, like D, is symmetrical about 
the diagonal. It is also applicable to problems involving less or 
more than five variables. The first step is to write down the inter¬ 
correlations (coefficients of the unknown jS's) in the form indicated 
in Table 18, in which the right-hand column contains the correla¬ 
tion of each variable with the criterion or dependent variable. 
Negative signs are attached to these coefficients because, in essence, 

* Kelley, T. L., Statistical method^ New York: Macmillan, 1924. 



Numerical Solution 


157 


we are dealing with equations (73). Obviously, if the original 
sign of an r were negative, it would be preceded by a plus sign in 
an arrangement like that in Table 18. 


TahU 18. 


Schema for Arranging r's for Doolittle Solution 


Xt 

^3 

X 4 

^5 

Xi 

1 

^28 

r24 


-ri2 


1 

^34 

^36 

-riz 



1 

r46 

-ri4 




1 

—rib 


As a numerical example, we shall use data from the Minnesota 
study of mechanical ability, f The sample size is 100. 

Let Xi = Criterion (mechanical performance-quality). 

X 2 = Minnesota assembling test. 

X 3 = Minnesota spatial relations test. 

X4 = Paper form board. 

Xq = Interest analysis blank. 

Since the several means and standard deviations will be needed, 
these are recorded in Table 19. 


Table 19. Means and SB's (Minnesota Data) 



Xi 

, X 2 

X 3 

A 4 

As 

M 

14.94 

127.56 

1422.90 

46.60 

107.00 

a 

2.09 

25.32 

296.39 

19.45 

18.00 


In Table 20 will be found the Doolittle solution for the ^ coeffi¬ 
cients. Once these are known, the regression equation, in raw 
score form, can be written, and the multiple r and the error of esti¬ 
mate can be determined. The table includes an indication of the 
calculation of these values. The student will have to study care¬ 
fully the schema of the Doolittle solution in order to grasp the 
necessary steps. We shall not attempt a complete exposition of 
the steps since the procedure of each step is indicated in the left- 
hand side of the table. A few remarks, however, will be of aid to 
the student. 

t Paterson, D. G., et al.^ Minnesota mechanical ability tests^ Minneapolis: 
University of Minnesota Press, 1930. 



158 


Multiple Correlation 


Table 20 , Computation op Multiple r 


X 2 

Xs X 4 

Xs 

Xi 

ck 

(a) 1.00 

.56 .49 

.42 - 

.55 

1.92 

(6) 

1.00 .63 

.46 - 

.53 

2.12 

(c) 

1.00 

.39 - 

.52 

1.99 

( d ) 


1.00 - 

.64 

1.63 

(1): line (a) 1.00 

.56 .49 

.42 - 

.55 

1.92 

(2) -1.00 

-.56 -.49 

-.42 

.55 - 

-1.92 

(3): line ( b ) 

1.000 .63 

.46 - 

.53 

2.12 

(4): (l)(-.56) 

-.314 -.274 

-.235 

.308 - 

-1.075 

(6):(3) +(4) 

.686 .356 

.225 - 

.222 

1.045 ck 

(6): (6)(-l/.686) 

-1.000 -.519 

-.328 

.324 - 

-1.524 ck 

(7): line (c) 

1.000 

.39 - 

.52 

1.99 

(8): (l)(-.49) 

-.240 

-.206 

.270 

-.941 

(9): (5)(-.519) 

-.185 

-.117 

.115 

-.542 

(10): (7) + (8) + (9) 

.575 

.067 - 

.135 

.507 ck 

(11): (10)(-l/.575) 

-1.000 

-.116 

.235 

-.882 ck 

(12): line ( d ) 


1.000 - 

.64 

1.63 

(13): (l)(-.42) 


-.176 

.231 

-.806 

(14): (5)(-.328) 


-.074 

.073 

-.343 

(15): (10)(-.116) 


-.008 

.016 

-.059 

(16): (12 + (13) + (14) 4- (15) 

.742 - 

.320 

.422 ck 

(17): (16)(-l/.742) 


-1.000 

.431 

-.569 ck 

Back solution 





From (17) 


.431 

- /35 


From (11) 

(.431)(- 

.116) + .235 

= ^4 = 

.185 

From (6) (.185)(- 

.519) + (.431)(- 

.328) + .324 

=* = 

.087 

From (2) 





(.087)(-.66) + (.186)(- 

.49) +{.431)(- 

.42) +.'65 

— ^2 ^ 

.230 

Final checks 





(.230)(1.00) + (.087)( .56) + (.185)( .49) + (.431)( .42) 

- .55 

= .000 

(.230)( .56) + (.087)(1.00) + (.185)( .63) + (.431)( .46) 

- .53 

= .001 

(.230)( .49) + (.087)( .63) + (.185)(1.00) + (.431)( .39) 

- .52 

= .001 

(.230)( .42) + (.087)( .46) + (.185)( .39) + (.431)(1.00) 

- .64 

« .000 

From formula (66) 





Bi - (.230) ^ ~ .0190 

, Bs - (.087) 

296.39 


Bi - (.186) = .0198 

, B5 - (. 431 ) 

0500 1 

18.00 

« 6.40 

Then 





X'l « .0190X2 + .0006X8 + . 0199 X 4 + . 0500 X 5 + 5.40 


r*i.s »46 = (.230)(.55) + (.087)(.63) + (.185)(.52) + (.431)(.64) - 

.54465 

n-2846 “ *738, ' O’!.2846 

= 2.09Vl-(.738)^= 1.40 







Numerical Solution 


159 


As already specified, the correlations are written down in an 
order corresponding to equations ( 73 ) except that values to the 
left and below the diagonal are omitted. The first thing we do is 
to set up a check column. The first entry, 1 . 92 , is obtained by 
summing, algebraically, the first row of correlations (including 
the diagonal 1.00); the second figure, 2.12, is the sum of the second 
row plus . 56 ; the third entry, 1 . 99 , is the sum of the third row 
plus .49 and . 63 ; and the 1.63 is the sum of the fourth row plus 
. 42 , . 46 , and . 39 . The rule being followed should now be obvious: 
the jth entry in the check column is obtained by summing the 
1.00 in the jth row with the values above it and to its right. The 
student should satisfy himself that this is equivalent to summing 
the correlations for the respective equations in ( 73 ). Since the 
check column will provide, at intervals, an automatic check on 
our computations, this summing should be done at least twice to 
insure accuracy. 

Line ( 1 ) of the solution is obtained by copying down line (a), 
the first row of r^s; and line (2) consists of the line (1) values with 
the signs changed. The second part of the solution begins with 
line (3), which is obtained by copying down the (h) row of correla¬ 
tions. Line ( 4 ) is obtained by multiplying entries in line (1) by 
— . 56 , which figure is foimd in line (2) directly above the 1.000 of 
line ( 3 ). As indicated at the left, line ( 5 ) results from summing 
lines ( 3 ) and ( 4 ), i.e., 1.000 + (—. 314 ) equals .686, etc. 

At this point we haVe our first automatic check: summing line 
( 5 ) across should yield 1 . 045 , already obtained by vertical summing 
of values in the check column. To be a satisfactory check, these 
two sums should agree within limits consistent with errors im¬ 
posed by rounding off to three decimal places. Acceptable dis¬ 
crepancies will be of the order zt.OOl, ±. 002 , • • • ±. 005 , seldom 
larger. 

Line (6) is obtained by multiplying line ( 5 ) by the negative 
reciprocal of its first entry. The correctness of the reciprocal used 
is evidenced by the fact that, when multiplied by .686, unity 
results. The ck attached to — 1.524 indicates that summing the 
entries in line (6) yields the same value as 1.045 multiplied by the 
negative reciprocal of .686, thus providing a further check. This 
completes the second part of the solution. 

The third part begins with a copying of row (c) of the correlation 
table. The student should now be able to follow the steps; in 



160 Multiple Correlation 

particular, he should note that a multiplier is secured from the 
last line of each preceding part of the solution; that each multiplier 
is applied in turn to the values in the line just above it; that, when 
all such multipliers have been utilized, the lines are summed 
(summing across again provides a check), and the resulting line 
is, as before, multiplied by the negative reciprocal of its first entry, 
thus completing the third part of the solution. 

The fourth part involves similar operations. If we had five 
independent variables, we would proceed in like fashion, with an 
additional or fifth part. The schema can be extended to any 
number of variables. There will be as many parts to the solution 
as there are independent variables. The last part always consists 
of three columns of figures, and the bottom figure in the middle 
column is the value for fin- In our example jSn = = .431. 

The other /3’s are determined by a ‘‘back” solution, which always 
involves a substitution of the value or values already found into 
the last line of the various parts (lines 11 , 6 , and 2 in our illustra¬ 
tion). This back solution is given in Table 20 . As a final check 
on all the computations, the four obtained must be substituted 
into the four simultaneous equations with which we began. This 
check appears next in Table 20 . 

In order to put our results into useful form, we ordinarily require 
the multiple regression equation in raw score form, and for this 
we need the B coefiicients and A as called for in formula ( 66 ) ex¬ 
tended for more variables. To get the multiple correlation coeffi¬ 
cient, the iS^s and appropriate r^s are substituted in formula (69a), 
and from (70) we obtain the standard error appropriate for judging 
the accuracy of predictions made by the calculated regression 
equation. Table 20 includes these additional values. 

If the problem involves analysis rather than prediction, one need 
not set up the regression equation or calculate the error of estimate. 
Appropriate interpretations.would depend upon the and ri .2345 
(see discussion, pp. 151-152). 


SAMPLING ERRORS 

The classical formula for the standard error of a multiple correla¬ 
tion involving n variables is 

1 — 7^123 - -n 



Cautions and Remarks 


161 


If N is very large, say over 500, and if the value of ri .23 ... n is 
not too high, this formula will provide a satisfactory approxima¬ 
tion. But when N is small and the number of variables, n, is 
large relative to the size of the sample, the above formula yields 
an underestimate of the error. The significance of the multiple 
correlatibn coefficient can best be ascertained by the analysis of 
variance technique, to be discussed in a later chapter. 

Closely related to sampling is the shrinkage of the multiple 
correlation coefficient. This may be best understood by taking an 
extreme case. For the ordinary bivariate correlation, it is evident 
on a moment^s reflection that, if JV” = 2, the correlation between 
the two variables must be perfect positive or perfect negative (it 
would be indeterminate if for either variable the two scores were 
the same); the regression line will pass through both plotted points 
on the scatter diagram. That is, in so far as prediction is con¬ 
cerned, there would be no error. In the case of three variables 
and N — S, it would be possible to pass a plane through all three 
plotted points. In general, if n = Ny we would get a perfect 
multiple r. Obviously N must be greater than n before any mean¬ 
ing can be attached to a multiple r. As n approaches A, the value 
of multiple r always approaches unity. 

This suggests that, when n is large relative to Nj the real signifi¬ 
cance of an obtained multiple r is questionable. In other words, 
the multiple correlation coefficient is subject to a positive bias, 
the magnitude of which depends upon the degree to which n 
approaches N. An unbiased estimate, r', of the universe value of 
ri .23 ... n can be obtained from 

(75) 

This is sometimes known as a correction for shrinkage, since it 
has been observed that in general the correlation between observed 
and predicted values for a new sample tends to be less than the 
multiple r obtained by means of the P's computed from the original 
sample. Obviously, if N is very large, say 500, and n small, say 
10, the amount of bias or expected shrinkage is so small as to be 
negligible. 

CAUTIONS AND REMARKS 

As already indicated, there are two principal uses for the multi¬ 
ple correlation technique: (1) it yields the optimum weighting for 




162 Multiple Correlation 

combining a series of variables in predicting a criterion and pro¬ 
vides an indication of the accuracy of subsequent predictions; 
( 2 ) it permits the analyzing of variation into component parts. 
There are certain more or less obvious pits into which the unwary 
user of the multiple regression and correlation method may fall. 
For example, it is possible to write a multiple regression Equation 
for predicting school achievement (Xi) from a knowledge of age 
(X 2 ) and mental age (X3). In standard score form it might be 
z'l = .27 z 2 + .67^3, from which one might infer that school 
achievement depends upon age to a certain extent but upon mental 
age to a greater extent. However, it is entirely possible to argue 
that mental age depends partly upon school achievement. One 
could also use the same data to write the regression for age on 
mental age and school achievement; thus z '2 = .56zi + .O 6 Z 3 , 
from which the imwary might conclude that age depends upon 
school achievement and mental age. 

Multiple correlation may be particularly deceptive when one has 
available several variables, each of which yields a rather low corre¬ 
lation with the criterion and from which those yielding the higher 
correlations with the criterion are selected for the prediction equa¬ 
tion. Such selecting tends to capitalize on correlations which 
might be high because of sampling fluctuations. For example, 
the author was once requested to compute the multiple r for an 
11-variable problem. None of the 10 variables showed a very high 
correlation with the criterion, the highest being .27. The resulting 
multiple was .44, which was statistically significant for the sample 
of 89 cases. When it was learned that 10 variables out of 40 had 
been selected as the most promising, i.e., because they showed the 
highest correlations with the criterion, the real significance of the 
multiple r of .44 was questioned. That it really was misleading 
was clearly evidenced by the fact that for a second and similar 
sample the variable originally yielding the highest r (.27) now 
produced an r of —.11. That is, the supposedly best single pre¬ 
dictor was actually of very doubtful value, and this, coupled with 
a tendency for the next highest r^s to drop appreciably, meant that 
predictions by the regression equation could not be as good as was 
inferred from the multiple of .44. 

Nothing has been said as yet concerning the principal assump¬ 
tion and consequent limitation in the use of multiple regression 
equations, namely, that regressions for the first-order correlations 



Cautions and Remarks 


163 


must be linear. There are methods for handling multiple correla¬ 
tion for curvilinear regressions. The reader is referred to 
M. EzekieFs Methods of correlation analysis, t 

It is not obvious from our discussion that, in general, the increase 
in the multiple correlation which results from adding variables 
beyond the first five or six is very small. This phenomenon of 
diminishing returns would not, of course, operate if we were to find 
an additional variable which correlated much more highly with 
the criterion than any of those already utilized. 

Another fact which may not be apparent to the reader is that we 
can expect the multiple r to be higher when the intercorrelations 
among the predictors are low instead of high. This point can be 
easily demonstrated to one^s own satisfaction by computing the 
multiples for, say, ri 2 = .50, ri 3 = .50, and varying values for 

^23- 

An interesting paradox of multiple correlation and an exception 
to the fact mentioned in the previous paragraph is that it is possible 
to increase prediction by utilizing a variable which shows no, or 
low, correlation with the criterion, provided it correlates well with 
a variable which does correlate with the criterion. Thus, if 
ri 2 = .400, ri 3 = .000, and r 23 = .707, the regression equation 
will be z\ = .8002^2 — .566;2f3, and ri .23 will equal .566. It is thus 
seen that, when 23 is combined with 22 ; an appreciable gain in pre¬ 
diction occurs even though when taken alone 23 is worthless as a 
predictor of zi. 

Such a variable has been termed a “suppressant.^' One does 
not quickly see just how a suppressant variable, showing no corre¬ 
lation with the criterion, can increase the accuracy of prediction. 
Perhaps this point can be explained by reasoning by way of the 
notion that correlation can be thought of in terms of common ele¬ 
ments (pp. 117--118). Suppose that Xi is composed of 10 elements, 
X 2 of 10, Xz of 5, and suppose that Xi and X 2 have 4 elements in 
common, X 2 and X 3 have 5 elements in common, and Xi and Xz 
have no overlapping elements. Diagrammatically, the variables 
and elements would be 

__ Xz 

aaaaaahhhh cddddd 
- Ti - 

t Ezekiel, M., Methods of correlation analysis, New York: John Wiley, 1941. 



164 Multiple Correlation 

By substituting in the common element formula for correlation, 
we find ri 2 = .400, ri 3 = .000, r 23 = .707. These lead to z\ = 
.800^2 . 56023 , and ri .23 = - 666 . Variable X 3 has a negative 

regression weight, i.e., by the use of X 3 something is being sub¬ 
tracted or suppressed. As set up here for illustrative purposes, 
all the elements of are contained in X 2 ] these elements are not 
related to Xi and hence their presence in X 2 must tend to lower 
the correlation between Xi and X 2 I if these elements could be 
suppressed, the correlation between Xi and X 2 minus the irrelevant 
(so far as Xx is concerned) elements of X 2 should be higher than 
ri 2 . Actually, if we think of the elements of the diagram as 
being nonexistent, we would have variation in X 2 dependent upon 
only 6 elements, 4 of which overlap with Xi. The co rrelation 
between Xi and the abridged X 2 would be 4/\/l0(5) or . 666 , 
which has exactly the same value as the multiple r obtained above. 
This exact correspondence to ri .23 will be obtained only when all 
the Xz elements are contained in X 2 . If X 3 contains other ele¬ 
ments, its use as a suppressant will aid in predicting Xi, but the 
resulting ri .23 will not correspond to an r deducible from the 
common element formula. The reason for this is left as an exercise. 

The student, by resort to the notion of common elements, may 
secure a better understanding of the proposition that a higher 
multiple is obtainable when the correlations with the criterion are 
high and the correlations between the predictors low or zero. 
The reader should be warned, however, that such a condition is 
hard to realize in practice, as is also the finding of variables which 
will qualify as suppressants. 


NOTE ON NOTATION 

The symbol ri .23 has been used to represent the correlation 
(multiple) between Xi and the best combination of X2 and X3. 
This should not be confused with ri 2 . 3 , which indicates the correla¬ 
tion (partial) between Xi and X 2 with the effect of X 3 ruled out 
or held constant. The symbol o-j,.*, it will be recalled, stood for 
the standard error of estimate of Y as estimated from X; <ri .2 
would be the error of Xi when estimated from X 2 ; and cri .23 would 
be the standard error of estimate of Xi when estimated from X 2 
and X 3 by means of the multiple regression equation. 



Note on Notation 


165 


In the foregoing discussion, P2 l^as been used as the symbol for 
the regression weight of X2. A more formal, albeit cumbersome, 
notation would be 1812 *346, which would be read as the regression 
of Xi on X2, i.e., the coefficient for X2, when taken in combination 
with X3, X4, and Xs. It is not an accident that the subscript 
pattern resembles that for the partial correlation coefficient. If 
we were dealing with a three-variable problem, P2 could be written 
as Pi2‘3- This notation really means that we have the net regres¬ 
sion of Xi on X2 when X3 is held constant. Hence the coefficients 
are sometimes spoken of as partial regression coefficients. As a 
matter of fact, these partial or multiple regression coefficients can 
be computed by way of partial correlation coefficients, but the 
method is not nearly so straightforward and self-checking as the 
Doolittle procedure. 



CHAPTER 10 


Other Correlation Methods 


The product moment correlation measure is applicable only 
when the two variables are graduated, is restricted by the assump¬ 
tion of linearity of regression, and needs careful qualifying if 
either or both variables yield skewed distributions. There are, 
therefore, many problems for which it is inappropriate. In general, 
the majority of the situations which are met in practice can be 
handled by some type of correlational technique. The use of rho, 
or the rank-difference method, has already been mentioned. 

There are no general rules to follow in the case of variables 
yielding skewed distributions. Frequently, one can use a logarith¬ 
mic transformation of such a variable and thereby secure scores 
which are at least approximately normal; or one may deliberately 
normalize the distribution by converting the raw scores into T 
scores. When one considers the arbitrary units involved in most 
psychological measurement, such a procedure would seem not 
only permissible but also defensible in that the correlational de¬ 
scription of the relationship need not be qualified because of 
skewness. 

The situations arising most frequently in practice, for which 
measures of correlation are apt to be needed, can be subsumed 
under the following five headings: (1) graduated measures for one 
variable, dichotomized or two-category information for the second 
variable; ( 2 ) both variables-dichotomized; ( 3 ) three or more cate¬ 
gories for one variable and two or more for the second; ( 4 ) three 
or more categories for one variable and a graduated series of meas¬ 
ures for the other; ( 5 ) both variables graduated, with curvilinear 
relationship. 

An estimate of the degree of correlation for each of the above 
situations can be obtained providing certain assumptions concem- 

166 



Biserial Correlation 


167 


ing the variables can be regarded as tenable. Ordinarily the 
graduated variable can be thought of either as being continuous 
or as progressing in a suflBcient number of discrete steps so as to 
give the appearance of continuity. The approach to normality 
for such series can, obviously, be specified. The nature of the 
categorized variable, whether discrete or continuous, can ordi¬ 
narily be ascertained on logical grounds, but the question of 
whether a continuous variable for which we have only a distribu¬ 
tion by categories would yield a normal distribution if we had some 
measuring stick for the trait is not easy to answer. Unfortunately, 
some of. the correlational measures to be discussed do assume 
normality for the dichotomized trait, and equally unfortunate is 
the fact that we are usually ignorant concerning the tenability of 
this assumption. 

BISERIAL CORRELATION 

If one variable is graduated and 3delds an approximately normal 
distribution and the other is dichotomized, and if we can assume 
that the underlying dichotomized trait is continuous and normal, 
then we can obtain a correlation measure which constitutes an 
estimate as to what the product moment r would be if both varia¬ 
bles were in graduated form. The most typical example of such a 
situation is to be found in the mental test field: the correlation 
between an item scored as pass or fail (or yes or no, like or dislike, 
etc.) and a graduated criterion variable. We need to know each 
individuars score on the graduated variable and whether he passed 
or failed the item. We can then make a distribution or scatter- 
gram with from 12 to 20 intervals for the graduated variable along 
the y axis, and with 2 intervals for pass and fail along the x 
axis. 

In order to secure a better imderstanding of the meaning and 
limitations of biserial r, let us begin with an ordinary scatter dia¬ 
gram, say between height and weight. Suppose that height is 
distributed along the y axis and weight along the x axis, and that 
we decide on some arbitrary point on the weight axis for dichoto¬ 
mizing individuals into “lights” and “heavies.” All individuals 
are to be classed into one of these two categories. For the purpose 
of our argument, it makes little difference whether the dividing 
line cuts off 30 , 40 , 50 , or 60 per cent as “lightvreights.” Suppose 
that 40 per cent fall into this category; then the correlational 



168 


Other Correlation Methods 


scatter diagram, with the vertical line of demarcation, might look 
something like that depicted in Fig. 14. 

Let us next recast the frequencies into the form we would have 
if only categorized information for the weight variable were avail- 



j • 


i • 

f 


! * * * 


1 • • • 


• • • • 


1 • • • • • 


• 1 • • • 


1 • • • • • • 


• j • • • • • 

• 

• !•• • • • 

• • 

•••j••• • • • 





• • • 

• ••!•••• • • 

• • • • • 

• • 1 • • • • 

1 

• • • • • • • 

• 1 •• • • 

1 

• •• • • 

•• 1 • • 

1 

• • • ••• • 

• • 1 

# • • • • 

• 1 

1 

• • • • • • 

• 1 

• • • • 

1 

1 

• • • • 

1 

1 

1 

• • 

1 

1 

• • • 

1 

1 

1 




Fig, 14> How a scattergram might be reduced to a biserial situation. 

able. The resulting “scatter’^ would be similar to that shown in 
Fig. 15, in which we have introduced two other solid lines. One 
of these is a horizontal line drawn at the mean Y value or mean 
height for all N individuals in the scatter. This would, of course, 
be the mean of the right-hand marginal distribution obtained by 
summing frequencies across the table. Call this mean, My, and 
the sigma of this distribution, <ry. The second line which we have 
drawn connects the mean hei^t of those in the ‘‘lightweight^’ 
group with the mean height of those in the “heavy” group. Let 






Biserial Correlation 


169 


us designate these means as Mi and M2 respectively, i.e., as height 
means for individuals in the first and second categories for weight. 

The slanting line, therefore, corresponds to the regression line 
for predicting Y from X, Its slope will be hyx = r^y^Oyln^. If we 



“Lights” “Heavies” 

Fig, 16. Scatter for biserial r. 


could determine a value for hyx and knew <ry and axj we could easily 
solve this expression for rxy, since algebraically 

T^xy “ ^yx 

(Ty 

It is easy to compute try; it is the sigma of the total Y distribu¬ 
tion. To evaluate 6yx, we observe that the slope of the line can be 
expressed in terms of certain distances in the figure. We can 
represent the distance M2 — My as t/2 and the distance My — Mi 















170 


Other Correlation Methods 


8 »s 2/1, but the meaning of the distances xi and X2 needs explanation. 
First we recall that for the ordinary scatter diagram, the regression 
line always goes through the point of intersection of the vertical 
and horizontal lines drawn through the means of X and V, There¬ 
fore Xi and X2 are distances measured from the mean of the total 
X distribution. If they are to be analogous to yi and 2/2> they 
should represent distances from the total X mean to the respective 
means for the X values of those in the first and second categories. 

If the student recalls that the slope of a line can be expressed as 
the ratio of the opposite to the adjacent side of the angle which it 
makes with the horizontal, he will see at once from the large trian¬ 
gle formed by Mi, M2, and C that 

. 2/1 + 2/2 

Oyx — 

Xi + X2 

Obviously, the distance yi + 2/2 equals M2 — Mi. The values 
for xi and X2 are not so readily determined when, as will always be 
true in the case of biserial r, we have only categorical information 
on the X variable. It is at this point that the making of an assump¬ 
tion will permit us to surmount the difficulty. As usual, there 
can be no objection to making an assumption provided we remem¬ 
ber that we have done so and that we thereby have placed a limita¬ 
tion on the use of the derived formula. 

We assume that underlying the dichotomized information there 
is a continuous normally distributed trait. It can be demonstrated 
that the correlation between distributed values for Y and X would 
not change if we were to transform all the X scores to standard 
scores. If this were done, we would need the two distances, Xi 
and X2, expressed in terms of standard scores. The needed value 
for (Tx would automatically become unity. Now xi is really the 
mean deviation for the individuals who fall in that portion, or tail, 
of the distribution cut off by our demarcation point between 
‘‘lightweights’’ and “heavies.” If there are q cases in the first 
category, it can be shown by methods of the integral calculus 
that for a normal distribution Xi will be equal to z/g, where z is 
the value of the ordinate for the unit normal curve at the point 
where q proportion of the cases are cut off. The value of X2 will be 
z/p, where p is the proportion in the second category (p + ^ = 1). 
As an exercise, the ingenious student might try demonstrating 



Biserial Correlation 


171 


(approximately) that the mean of a tail of a distribution does equal 

2/g. 

We are now ready to piece together the several parts of 


Thus 


hence we have 


^xy 




ay 


Vl + 2/2 (Tx 
Xi X2 ay 


M 2 - Ml 1 



(M2 - Mi)pq 

Txy = - = n 

zay 


(76) 


as the formula for biserial correlation. 

Its computation is not at all diflScult; a table of ordinates for the 
unit normal curve is needed (the reader will recall that the maxi¬ 
mum ordinate, for p = g = .50, in such a table is .3989). 

It is readily noticed that, if the cases falling in the second cate¬ 
gory represent those possessing more of the dichotomized trait, 
then rh will be positive when M 2 —• Mi is positive, and negative 
when M 2 — Ml is negative. If M 2 = Mi, rj, == 0. Another 
obvious point: since only the product of p and q is involved, it 
really doesn’t make any difference which proportion we call 
V (or q). 

If the reader has followed the foregoing development, he will 
have inferred that r^ really represents an estimate of what the 
product moment correlation would be if a measuring stick were 
available for the dichotomized variable. The assumptions under¬ 
lying the formula for biserial r should also have been grasped. 
(Linearity of regression was assumed at what point in the above 
discussion?) 

If one is determining the biserial correlation between a criterion 
and a series of items, a time-saving alternate form of equation (76) 
is suitable: 

(M 2 - My)p2 



172 


Other Correlation Methods 


in which p2 is the proportion of cases falling in the second cate¬ 
gory—or Ml and pi could be used; i.e., the subscripts should agree. 
The obvious advantage of this form is that, once the value for My 
is calculated, then the r for each additional item entails the compu¬ 
tation of only one mean. The z is, of course, the ordinate corre¬ 
sponding to the value of p2- Both p2 and z must be determined 
for each item. The derivation of ( 76 a) from ( 76 ) is left as an 
exercise. 

The sampling error of biserial r is given approximately by 


Vm _ ^ 



As an exercise, the student should compare the magnitude of the 
sampling error of biserial r for varying values of p (and q) with 
that of the product moment r as given by the classical form, 

_ 1 - r2 

Vn 

It might be anticipated that the sampling error will be large when 
the dichotomies are extreme, i.e., involve cuts yielding extreme 
values for p and q. Thus, if iV = 100, and q = . 05 , it follows that 
one of the means used in computing by formula ( 76 ) will be 
based upon only five cases and consequently will be subject to 
rather large sampling fluctuation, which incidentally will not be 
counterbalanced entirely by the relatively greater stability of the 
other mean. It may occur to the reader that the use of formula 
( 76 a) would overcome this difficulty, since one could always arrange 
to use the mean for the category having the larger number of cases, 
thereby avoiding the unstable mean. This appears plausible 
enough; its refutation is left to the student. 

The fact that the sampling error for biserial r is large when ex¬ 
treme dichotomies are involved should serve as a warning. Unless 
N is fairly large, one should not place much confidence in a biserial 
r based on cuts more extreme than .10 or . 90 . 

Since no r to 2 transformation is available for use with biserial r, 
the difficulty of skewed sampling distributions for high r^^s cannot 
be overcome. In testing the null hypothesis (that no correlation 
exists) art becomes _ 

Vpq 

z-\/N 



Biserial Correlation 


173 


Although a biserial r is an estimate of a product moment r, there 
is question as to its interpretation. It is, of course, a measure of 
the degree of relationship existing between a graduated and a di¬ 
chotomized variable. It does not, however, enter into prediction 
formulas, nor does it lead to an error of estimate. If we know to 
which X category an individual belongs, the predicted Y is simply 
the mean of the Y scores for that category, and the error of such 
an estimate is the standard deviation of the Y scores in the given 
category. This error of estimate would not equal (Ty^/l — 

If one has a Y score to use in predicting an individuaUs X cate¬ 
gory, he estimates on the basis of the frequency with which those 
possessing Y scores in a given interval tend to fall predominantly 
into the first or second category for X. The error for such a pre¬ 
diction must depend upon the relative frequencies in these two 
categories for individuals possessing the given Y score. Thus, if 
the frequencies for the first and second categories were 18 and 6, 
the error might be stated something like this: the odds are 3 to 1 
that the given individual's X position is in the first category, i.e., 
75 per cent of the time the prediction would be correct. Such a 
statement of error would need to be qualified according to the 
values of p and q. How? 

The tenability of the assumption of a continuous normally dis¬ 
tributed variate underlying the dichotomized trait must always 
be faced by the user of biserial r. This assumption is not a pre¬ 
requisite for the point biserial correlation coefficient. Such a 
coefficient assumes that for the trait underlying the dichotomy the 
values can be thought of as falling at two points instead of being 
distributed in a continuous fashion. Since there are few, if any, 
traits in psychology which qualify as yielding such discontinuous 
point distributions, it follows that the use of point biserial r may 
seldom be defensible. Those who advise the point biserial for 
work with mental test items certainly ignore what seems to be 
obvious, namely, that failing a test item represents anything from 
a dismal failure up to a near pass, while passing the item involves 
barely passing up to passing with the greatest of ease. One can 
also argue that there are likely to be more individuals who nearly 
pass than who fail dismally. Such a line of reasoning does not, of 
course, justify the assumption of normality, but it certainly is 
presumptive evidence for continuity. As a matter of fact, one 
can usually justify the use of the regular biserial r with obviously 
continuous variables by saying that the coefficients obtained repre- 



174 


Other Correlation Methods 


sent what we would expect the product moment correlation to be 
if we had a measuring scale, for the dichotomized trait, which 
actually yielded a normal distribution. The bothersome question 
concerning the normality of the distribution of the trait—^if and 
when measured in some hypothetical, unattainable, equal units— 
is thereby side-stepped. If the point nature of the dichotomized 
variable can really be demonstrated, then and only then can one 
justify using the point biserial correlation coefficient, which can 
be obtained by 



In general, the point biserial r tends to be lower than and hence 
is not comparable with either or the product moment r. 

TETRACHORIC CORRELATION 

When both variables yield only dichotomized information, as, 
for example, two items scored as passed or failed, it is possible to 
secure an estimate of what the correlation would be if the under¬ 
lying traits were continuous and normally distributed or if they 
were so measured as to give normal distributions. The measure 
of relationship for such a situation is known as the tetrachoric 
correlation coefficient, usually designated as r^. It is not feasible 
to derive here the formula for tetrachoric correlation, but perhaps 
a few words will help one imderstand the reasoning back of the 
formula. 

Let us suppose that we have before us a scattergram for the 
correlation between height and weight; let us further assume that 
this scatter exhibits all the characteristics of a normal correla¬ 
tional surface as defined by equation (41). That is, the two 
marginal distributions and ’all the vertical and horizontal array 
distributions are normal, the regressions are linear, and the arrays 
homoscedastic. For such a normal plot, it is possible, knowing 
the degree of correlation and the means and sigmas of the two 
variables, to specify how many or what proportions of the cases 
will faU in any given segment of the scatter plot. This can be 
done by mathematical manipulation of formula (41) or by the 



Tetrachoric Correlation 175 

aid of Table VIII of Pearson^s Tables for statisticians and biome^ 
tricianSf part IL * 

Now, of course, if one had placed before him a scatter for height 
vs. weight and were asked how many cases fell in that portion of 
the table below 120 pounds and also below 68 inches, he would 
simply count them. But suppose he were told that, when the 
two axes were cut at 120 pounds and 68 inches, the frequencies in 
each of the four quadrants so formed were as shown in Table 21. 
The purpose of tetrachoric correlation is to ascertain the degree of 
correlation which would permit the observed frequencies in such a 
fourfold table. A more rigorous statement would be: Given the 
four frequencies, what should be the true correlation—^for the 
scatter underlying the fourfold table—^in order to make the ob¬ 
tained four frequencies most likely? 


Table 2U Correlation for Height and Weight Dichotomized 



Below 

Above 


120 lb. 

1201b. 

Above 68 in. 

10 

80 

Below 68 in. 

60 

60 


70 

130 


90 

110 

200 


In order to secure tliis estimate it is necessary to convert into a 
proportion each of the four frequencies and each of the marginal 
totals by dividing by N. For the fourfold table we may symbolize 
the frequencies as in Table 22, the proportions as in Table 23 . 


Table 22, Frequencies Table 23. Proportions 


A-{‘B 
C + D 
A+C B + D N 


+ 


A 

B 

C 

D 



P 

3 


q' p' 1.0 


* Pearson, Karl, Tables Jot statisticians and hiometricians, part II, Cam¬ 
bridge: Cambridge University Press, 1931. 




176 


Other Correlation Methods 


Then, the tetrachoric coefficient can be obtained from the follow¬ 
ing rather forbidding equaticHi: 

c-qq' , ^ 2 t^/ 2 

- = r + xy— + ( x ^- 1)(2/® - 1 ) — 

Za^Zy 2 6 

+ (x3~3a:)(2/3-32/)-+..- (79) 
24 

in which it is assumed that both q and q' are less than .50. The 
general rule is to choose whichever is smaller, p or 5 , to pair with 
whichever is smaller, p' or q\ This determines, logically, whether 
a or 6 or c or d becomes a part of the formula. Thus one can have 
c — qq' (as given), or 6 — pp', each of which will yield a positive r 
for positive correlation or a negative r for negative correlation, 
or one can have a — g'p or d — gp', each of which will yield an r 
with sign opposite to its true sign. (It is, of course, here assumed 
that reading to the right on the x axis and up on the y axis means 
Tnore of the traits.) 

We must next specify the meaning of the x, p, and z^s in formula 
(79). As for biserial r, Zy is the ordinate of the unit normal curve 
where q proportion of the cases are cut off ; Zx has a similar meaning 
for g'. The y represents the value on the base line of the unit 
normal curve where g cases are cut off, i.e., the x/a in Table A of 
the Appendix, and x is similarly determined from a knowledge of 

s'. 

To equation (79) additional terms may be added which will 
result in a closer approximation at the expense of a greater, if not 
an impossible, amount of computation. For the given formula, 
the solution for r involves determining the roots of a fourth-degree 
or quartic equation. Either Homer^s or Newton^s methods, as 
described in college algebra texts, will do the trick. The fourth- 
degree equation will yield satisfactory approximations except 
when r is high. 

The solution of a quartic equation is not difficult, nor is it so 
easy as to lead to mass production of tetrachoric r's. Fortunately, 
it is no longer necessary to go through this tedious method for 
getting an approximation to the value of rt. Diagrams t are 

t Chesire, L., SaflSr, M., and Thurstone, L. L., Computing diagrams for the 
tetrachoric correlation coefficient, Chicago: University of Chicago Bookstore, 
1933. 



Tetrachoric Correlation 


177 


available which enable one to determine quickly the value of n 
for any given table of proportionate frequencies. Anyone having 
as many as a half-dozen tetrachorics to compute will find it eco¬ 
nomical to possess a copy of these diagrams. 

The tetrachoric r is particularly useful in estimating the degree 
of correlation between variables for which we have only dichoto¬ 
mized information, but it can also be used instead of biserial r 
or the product moment r, since situations for which these two 
methods apply can readily be converted into fourfold tables by 
simply dichotomizing the graduated variables. The advantage of 
so estimating correlation is that tetrachoric r is much easier to 
determine (by using the computing diagrams) than is calculating 
either biserial r or the product moment r. Indeed, this fact of 
computational economy has led a number of investigators to use 
Tt when product moment r^s could be determined. That such a 
practice may be short-sighted economy becomes quite evident 
when we turn to the sampling fluctuation of 

The standard error of is closely approximated by 

(80) 

When this is compared to the classical formula for the standard 
error of a product moment r, i.e., to o-r = (1 — it will 

be seen that the tetrachoric r has a much larger sampling error. 
To illustrate the difference, the sigmas for four r^s for two different 
dichotomies are presented in Table 24 along with the sigmas (by 
the classical formula) of the corresponding product moment r's 
for N = 100. 

Table 24- Sampling Errors of u and r Compared 


r or rt 

P 

P' 

<rrt 

<rr 

.00 

.50 

.50 

.157 

.100 

.00 

.80 

.80 

.204 

.100 

.40 

.50 

.50 

.130 

.084 

.40 

.80 

.80 

.182 

.084 

.60 

.50 

.50 

.115 

.064 

.60 

.80 

.80 

.150 

.064 

.80 

.50 

.50 

.073 

.036 

.80 

.80 

.80 

.095 

.036 




178 


Other Correlation Methods 


It can readily be seen from this table that is much less stable 
than r; in fact, even for the most favorable comparison (.60-.60 
cuts, low r^s), the standard error of the tetrachoric coefficient is 
more than 50 per cent greater than that for the product moment 
coefficient. This means that one must have more than twice as 
many cases to attain the same degree of sampling stability for a 
tetrachoric as for a product moment correlation coefficient. For 
.80-.20 cuts and low correlations, four times as many cases are 
needed to have comparable sampling errors. For high correlations 
and also for more extreme cuts, rt compares still less favorably 
with r. 

The foregoing discussion and further study of formula (80) lead 
to two obvious conclusions. 

First, the increasing sampling instability of n as the dichotomies 
become more extreme warns us that, unless N is large, one cannot 
place much reliance on Vt for cuts more extreme than .10~.90; 
seldom will N be large enough to warrant confidence in a tetra¬ 
choric based on cuts more extreme than .05--.95. 

Second, in using u instead of the product moment r when the 
latter is calculable, one is always throwing away the equivalent 
of more than half the available data. Thus the computational 
economy may be an expensive luxury—it is very doubtful whether 
the calculation of a product moment r for N cases will ever require 
anything but a fraction of the expense of securing data on the addi¬ 
tional N cases needed to counterbalance the greater sampling error 
incurred in using the tetrachoric coefficient. 

As in the case of n, no r to 0 transformation exists for handling 
the sampling errors of high tetrachorics. For testing the null 
hypothesis, that n for the universe is zero, we m ay use a simpler 
expression for its standard error, namely, <r,.^ = 

Another method for judging the significance of the correlation 
computed from a fourfold table will be presented in the next 
chapter. 

The use of tetrachoric r is circumscribed by an assumption: that 
the underlying correlational surface is of the normal type. Among 
other things this implies (a) that the dichotomized traits are con¬ 
tinuous and normally distributed, and (5) that the regressions are 
linear. Although, as discussed in connection with biserial r, we 
are usually ignorant of the tenability of (a), this ignorance can be 
partially overcome by regarding the correlation as that which 



179 


Contingency Coefficient 

would obtain if the traits were normalized; i.e., it can be argued 
^ that the use of tetrachoric r automatically normalizes the distribu¬ 
tions. It is not so easy to dispose of assumption (6), since the 
normalizing of variables will not necessarily lead to linearity of 
regression. The only consolation here is that measured psycho¬ 
logical traits are usually linearly related, if related at all. 


FOURFOLD POINT CORRELATION 


If one can safely assume point distributions for both dichoto¬ 
mized variables, a measure of correlation can be obtained from the 
fourfold table (Table 22) by 


_ BC-AD _ 

“ V(A + B){C + DKA + C){B + D) 


( 81 ) 


which is known as the fourfold point correlation coefficient. Those 
who propose to use this must take into consideration the questions 
regarding point distributions discussed on p. 173. In general, 
formula (81) tends to yield values which are lower than the corre¬ 
sponding tetrachorics. Although we do not approve of the point 
r as a measure of association, it is the correct value to use in the 
CTD for correlated proportions, but this standard error can more 
readily be computed by formula (28) or (28a), both of which take 
into account rp as the measure of correlation. 


CONTINGENCY COEFFICIENT 


The contingency coefficient is a measure of the degree of associa¬ 
tion or correlation which exists between variables for which we 
have only categorical information. The number of categories can 
be such as to provide a 2 by 2 table (as for tetrachoric correlation) 
or a 2 by 3, or a 3 by 3, or a 3 by 4, or a 4 by 4, or a ifc by Z table. 
This coefficient is stated in terms of a quantity known as 
(chi square) thus 




+.x" 


( 82 ) 


X 


2 


E 


( 83 ) 


where 



180 


Other Correlatioii Methods 


in which 0 is the observed frequency (not percentage) and E is 
the expected frequency for a given cell. In a 2 by 3 table there 
would be 6 cells, hence 6 values summed to get The expected 
cell frequencies for the contingency situation are those frequencies 
which would exist if there were no association or relationship 
between the given variables. It can thus be anticipated that, 
the larger the discrepancy between expected and observed fre¬ 
quencies relative to the expected, the larger the value of and 
consequently the higher the value of C. 

An example will help to clarify the above. Suppose that we 
have 2 variables, each of which yields 3 categories or classifications, 
and that the observed frequencies are as given in Table 25, which 
also contains the expected frequencies in parentheses. (Fictitious 
data; marginal frequencies arranged so as to simplify exposition.) 
In order to ascertain the expected frequencies, needed in the com¬ 
putation of x^j we ask what cell frequencies would be expected if 
there were no relationship, or zero association, between the 2 
variables. Consider the 100 classified as college; if no association 
existed, one would expect that these 100 would be distributed 
according to a 1, 3, 1 ratio, i.e., in the same ratio as the marginal 


Table 26, Contingency Table 



Low 

Medium 

High 

College 

6 

(20) 

45 

(60) 

50 

(20) 

High school 

60 

(40) 

110 

(120) 

40 

(40) 

Grade school 

45 

(40) 

145 

(120) 

10 

(40) 


100 300 100 


100 

200 

200 

500 


frequencies at the bottom. Thus the expected cell frequencies for 
the top row of cells would be 20, 60, 20. The expected frequencies 
for the middle and bottom rows of cells should also be in a 1, 3, 1 
ratio. Both these rows would have expected frequencies of 40, 
120, 40. 

It will be noted that (1) the expected frequencies for the columns 
follow, as they should, the ratio of 1,2, 2, i.e., the ratio of 100, 200, 




Contingency Coefficient 


181 


200 for the marginal frequencies on the right; (2) the expected 
frequencies sum to the same marginal totals as the observed fre¬ 
quencies; and (3) the expected frequencies actually exhibit a zero 
relationship between the two characteristics. 

In practice, the computation of the expected frequencies can 
readily be accomplished by either of two schemes: (a) express the 
marginal totals along the bottom as proportions of the total N, 
then multiply each of the frequencies on the right margin by each 
proportion in turn, entering the resulting product in the cell com¬ 
mon to the two marginal figures involved in the multiplication; 
or (6) multiply any frequency on the bottom margin by any fre¬ 
quency on the right margin, and divide this product by Nj and 
the result is the expected frequency for the cell common to the two 
marginals involved in the products. 

The computation of is now a routine matter. We simply 
take each cell in turn, square the difference between the observed 
and expected value, and divide by the expected frequency. Thus 
we have 

(5 - 20)720 = 11.25 
(45 - 60)760 = 3.75 
(50 - 20)720 = 45.00 
(50 - 40)740 = 2.50 
(110 - 120)7120 = .83 

(40 - 40)740 = .00 

(45 - 40)740 = .62 

(145 - 120)7120 = 5.21 
(10 - 40)740 = 22.50 

The sum of these quantities, 91.66, is x^- To get C, the coeflScient 
of contingency, the value of is substituted in formula (82), thus 


C 




91.66 


500 + 91.66 


= .39 


This strength of association is not to be interpreted as indicating 
the same degree of relationship as an ordinary (or biserial or tetra- 
choric) coefficient of the same magnitude. One reason for this is 
that the upper limit for the contingency coefficient is a function 
of the mnnber of categories. The upper limit for a 2 by 2 table is 
a/J; for a 3 by 3 table, Vf; for a 4 by 4 table, \/f; for a 5 by 5 



182 


Other Correlation Methods 


table, V?>* for a A; by A; table, V(ifc — 1)/A;. The exact upper 
limits for rectangular tables, such as 2 by 3, 2 by 4, 3 by 4, are 
unknown. (As an exercise, the student might demonstrate to his 
own satisfaction the upper limit for 2 by 2 and 3 by 3 tables.) 
The reader will also note that C can never be negative. 

Despite having varying maximal values, contingency coeffi¬ 
cients have a decided advantage over other measures of relation¬ 
ship; no assumptions involving the nature of the variables need 
be met—continuous or discrete variables, normal or skewed or 
any shaped distributions for imderl 5 dng traits, ordered or unor¬ 
dered series, and combinations thereof are permissible. 

Disadvantages are that any two contingency coefficients are not 
comparable unless derived from tables of the same size, that they 
are noncomparable to product moment r’s (and estimates thereof) 
unless certain corrections are applied, and that the formula for 
sampling error is unwieldy. The necessary corrections and the 
sampling error formula may be found in Kelley, J but before con¬ 
sulting Kelley, the reader might bear in mind the following com¬ 
ments. 

In regard to the corrections, the first is for number of categories. 
The additional correction to make C an estimate of r involves the 
assumption that the underlying traits are continuous and normal 
in distribution. Furthermore, this correction is very tedious to 
make. It is suggested that, if the assumption of normally dis¬ 
tributed continuous variables is tenable, one is justified in reducing 
a contingency table of more than four cells to a 2 by 2 table and 
then determining the value of tetrachoric r. When reducing to a 
fourfold table, one should combine adjacent categories so as to 
have dichotomies as near to .60-.60 proportions as possible. 
The combination should not be made on the basis of the pattern 
of cell frequencies, since this is likely to involve a capitalization or 
decapitalization on chance. One might take several or all possible 
fourfold combinations, thus securing several tetrachoric r^s which 
may then be averaged. 

As to the unwieldy sampling error formula for C, it is suggested 
that in so far as one wishes simply to test the null hypothesis, i.e., 
that there is no relationship between the two given variables, one 

t Kelley, T. L., Statistical method^ pp. 266-271, New York: Macmillan, 
1924. 



183 


The Correlation Ratio or t| (Eta) 

need only enter the value of into an appropriate probability 
table to test its significance. If is significant, then the rela¬ 
tionship is significantly greater than zero. This use of x^ will be 
discussed in the next chapter. It should be remarked that, if 
any one (or more) expected cell frequency is small, say less than 
5, the resulting C may be quite erroneous. 

Chi square for a fourfold table can be readily obtained by for¬ 
mula without first computing expected frequencies. Thus for a 
set of frequencies hke that of Table 22 we have 

2 _ 

” (A + B){C + D){A + C){B + D) 

This resembles formula (81). In fact, there is a relationship be¬ 
tween the fourfold point coefficient (rp), x^, and C: 

r2p = ^ and C = 

N ^1 + r^p 

Other measures of association or of correlation between attrib¬ 
utes have been advocated. This is not the place to argue the pros 
and cons of these other measures. It seems to the author that the 
measures we have discussed are the more defensible. 

THE CORRELATION RATIO OR r\ (ETA) 

It wiU be recalled that one way of understanding the product 
moment correlation coefficient is to note from the relationship, 

= 1 -- a^y,x/<y^v (or = 1 — (r^x i/A^*), that the degree of corre¬ 
lation is a fimction of the error of estimate variance relative to the 
total variance of the variable being predicted by a linear regression 
line. If the array means fail to fall on a straight line, it can rightly 
be argued that better prediction can be made by using a curve 
which really ^ffits^’ the means or by using the means themselves. 
The latter procedure would entail an error of estimate which 
would be a function of the variance within the arrays about the 
array means. An over-all variance about the means of the vertical 
arrays can be calculated by squaring the deviations about the 
mean of each array, summing these for all arrays, and then divid¬ 
ing by N. The resulting variance for the vertical arrays may be 
labeled o^ayj for the horizontal arrays, <^ax- 



184 


Other Correlation Methods 


The correlation ratio, rj, in teiros of the accuracy with which 
y's can be predicted from X’s is defined as 

= 1 - ^ ( 84 ) 

(T y 

and for predicted from we have 

o . ax ^ . 

^ ~ 1 ^ (84a) 

X 

Are two rj^s necessary? We have not proved herein that the 
variance about the mean is smaller than about any other point, 
but this fact is readily deducible from the computational formula 
for <T in terms of deviations from an arbitrary origin. If AO coin¬ 
cides with the mean, will equal if AO does not coincide 
with the mean, a subtractive term will always be involved. It 
follows that (Tay will be less than Cy.x and that <rax will be less than 
cTa-.y,* hence both will exceed r, but to varying degrees, depend¬ 
ing upon the extent to which the array means fail to fall on a 
straight line. Since it is possible, and likely, that the means for 
the vertical arrays will not exhibit the same departure from lin¬ 
earity as those for the horizontal arrays, it is not reasonable to 
expect the two to agree. 

The rj^s indicate the relative accuracy with which one can pre¬ 
dict on the basis of array means, and accordingly they are useful 
measures of the extent of correlation when the regressions are 
curvilinear. The correlation ratio can also be utilized when the 
regression is linear; hence it is more generally applicable than the 
product moment coefficient, which is useful only in the special 
case where the assumption of linearity is tenable. The correlation 
ratio, however, does not enter into the regression equation con¬ 
stants. 

Even if the regressions were exactly linear for some defined 
population, a given sample would show deviations from linearity, 
and therefore rj^s for successive samples would show chance sam¬ 
pling deviations from r. By how much must exceed r before 
one suspects curvilinearity? The only adequate statistical test 
for answering this question involves the analysis of variance 
technique and hence is postponed to Chapter 13. 



The Correlation Ratio or r\ (Eta) 


185 


Another definition of rj can be had by starting with the proposi¬ 
tion that the variance can be broken down into components, a 
predictable and an unpredictable part, or <r^y = in 

which is the variance of the array means weighted for the 
number of cases in the several arrays. Then we have rj defined as 
V^yx == <^^myl^^y s-^d also as t^xy = These are analogous 

to = (r^yfl(T^y and accordingly we may inter¬ 

pret Tj^yx as the proportion of Y variance explained by or associated 
with variation in X. 

Since the are most readily computed by methods to be 
developed in Chapter 13, no illustration will be given here. 



CHAPTER 11 


Frequency Comparison: Chi Square 


The quantity chi square (x^), defined in the last chapter as 


= S 


(O - Ef 
E 


(83) 


or as the sum of the squared discrepancies, between observed and 
expected frequencies, each divided by the expected frequency, is a 
statistic which is very useful in a variety of problems involving 
frequencies. Let us begin by an examination of what might be 
expected to happen if a penny were tossed 100 times. The ex¬ 
pected frequency for heads is 50, and for tails is also 60. If for a 
particular series of tosses we secured 55 heads and 45 tails, the 
discrepancies would be +5 and —5. When these discrepancies 
are squared, each becomes +25, and dividing each squared dis¬ 
crepancy by the expected value we would have .5 + .5 = 1.0 as 
the value for Had we obtained 40 heads and 60 tails, the 
discrepancies of —10 and +10, when squared and divided by E, 
would give 2 + 2 = 4 as x^. 

Three things are readily apparent from the above: first, the 
greater the discrepancy relative to E, the greater the contribution 
to x^; second, the two parts being summed to obtain x^ are not 
independent —^when the absolute discrepancy for heads is known, 
that for tails can be inferred to be the same; and third, the squaring 
process means that x^ is always a positive quantity regardless of 
the direction of the discrepancies. A fourth fact becomes appar¬ 
ent if one recalls what happens when a series of tosses is repeated. 
The number of heads (or tails) secured will vary from one series 
of 100 tosses to the next; hence the amount of discrepancy will 
vary, and therefore the magnitude of x^ will vary from series to 
series. In other words, successive sampling will yield varying 

186 



Empirical Demonstration 187 

values for x^* If we knew the sampling distribution for x^, we 
could specify the probability of securing by chance as large a value 
as any obtained x^, and thereby we could judge whether a given 
amount of discrepancy is significantly large enough to warrant 
the conclusion that the coin is biased. 

Situations similar to this arise in research work. We may, on 
the basis of a hypothesis that a certain proportion of individuals 
possess a given characteristic, state how many of a sample of N 
cases would be expected to show the characteristic. Observations 
on N cases will provide an observed number. If the hypothesis 
is tenable, the discrepancy between observed and expected should 
be no larger than might arise on the basis of chance. If the ob¬ 
tained discrepancy is too large, i.e., not apt to arise by chance, 
the hypothesis becomes suspect. The student who recalls that 
the standard error of a proportion can be used in comparing ob¬ 
served with expected proportions may wonder whether another 
technique is necessary. The answer will be forthcoming. 

EMPIRICAL DEMONSTRATION 

Perhaps some insight regarding the sampling distribution of 
X^ can be obtained by a reconsideration of the empirical series 
given on p. 49 for the number of times exactly 3 heads turned up 
when 7 coins were tossed 100 times. On the basis of the proper 
term in the binomial expansion, exactly 3 heads would be expected 
.273 or 27.3 per cent of the time. Since N = 100, the expected 
frequency for exactly 3 heads would be 27.3, and consequently 
the expected frequency for other than exactly 3 heads, or all other 
possibilities, would be W — 27.3, or 72.7. Table 26 shows that 
for 93 successive samples the actual frequency for exactly 3 heads 
varied from 15 to 35; i.e., 1 sample yielded 3 heads 15 times, 2 
samples yielded 3 heads 17 times, etc. Since exactly 3 heads 
turned up 15 times in 1 sample, it follows that other possibilities 
turned up 85 times; hence the 85 in the second column opposite 
15, 83 opposite 17, etc. The / column indicates that 2 samples 
yielded 3 heads 35 times, that 3 samples gave 3 heads 34 times, 
etc. The fourth column gives the discrepancy between the ob¬ 
served and expected number of times that exactly 3 heads turned 
up, and the fifth cohunn gives the discrepancy between observed 
and expected number of times for “not 3 heads.Note that these 



188 


Frequency Comparisons Chi Square 


2 columns differ only as to direction—^knowing one, the other can 
be written down at once. 

Table 26. Empirical Illustration of the Sampling Variation op 
Based on Number of Times 3 Heads Turned up When Each of 93 
Students Tossed 7 Coins 100 Times 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Times 





, __ 



O 

1 

o 

1 


Exactly 

Not 3 


for 1st 

for 2nd 

^ (0 - E)^ 

3 Heads 

Heads 

/ 

Col. 

Col. 

E 

35 

65 

2 

7.7 

-7.7 

2.99 

34 

66 

3 

6.7 

-6.7 

2.26 

33 

67 

2 

5.7 

-5.7 

1.64 

32 

68 

6 

4.7 

-4.7 

1.11 

31 

69 

4 

3.7 

-3.7 

.69 

30 

70 

3 

2.7 

-2.7 

.37 

29 

71 

6 

1.7 

-1.7 

.15 

28 

72 

8 

.7 

-.7 

.02 

27 

73 

8 

-.3 

.3 

.00 

26 

74 

10 

-1.3 

1.3 

.09 

25 

75 

12 

-2.3 

2.3 

.27 

24 

76 

7 

-3.3 

3.3 

.55 

23 

77 

4 

-4.3 

4.3 

.93 

22 

78 

9 

-5.3 

5.3 

1.42 

21 

79 

3 

-6.3 

6.3 

2.00 

20 

80 

3 

-7.3 

7.3 

2.68 

19 

81 





18 

82 





17 

83 

2 

-10.3 

10.3 

5.35 

16 

84 





15 

85 

1 

-12.3 

12.3 

7.62 


93 


The sixth column gives the values of for the several dis¬ 
crepancies. The computation of these values is accomplished by 
substituting directly into formula (83). Thus for the sample 
which yielded exactly 3 heads 15 times (and other possibilities, 
85) we have 


(15 - 27.3)^ (85 - 72.7)2 

^ ^ = 7.62 

27.3 72.7 


The third column may now be interpreted as giving the fre¬ 
quencies for given values of x^ yielded by the 93 samples. Note 



Empirical Demonstration 189 

that (a) all the chi squares are positive, (6) small values for 
occur more frequently than large values, and (c) for this particular 
setup involving 100 tosses, it is impossible to secure any values 
between .00 and 2.99 other than those given in the sixth column. 
This last observation may seem puzzling at first, but, with the 
expected values fixed at 27.3 and 72.7 and the observed values 
restricted to integers, it is readily seen that no values between, 
for example, 2.68 and 2.99 can occur. This point may be made 
clearer by noting the possible values for in the simpler situation 
where a coin is tossed 10 times. The expected number of heads is, 
of course, 5, and the possible discrepancies or (0 — E) values are 
(0 — 5), (1 — 5), (2 — 5), (3 — 5), • • • (10 — 5). Squaring each, 
dividing by E (= 5), and doubling—since in this case (0 — E'f jE 
is the same for tails—^we get the following values for x^: 10, 6.4, 
3.6, 1.6, .4, 0, .4, 1.6, 3.6, 6.4, and 10. No other values are possible 
because the possible number of heads is a discrete series; hence 
the -yd values are also a discrete series. This lack of continuity 
imposes a restriction on the use of x^ which will receive more 
attention as we proceed. 

The sampling behavior of x^ can be better depicted by assem¬ 
bling the 93 values of Table 26 into an ordinary frequency distri¬ 
bution with intervals running in one direction from zero. This 
has been done in Table 27, which also includes the 93 x^ values 
derived from Table 8 for exactly 6 heads vs. other possibilities 
when 7 coins were tossed 100 times. Two important facts are 
deducible from the empirical distributions of Table 27. First, the 
sampling distribution of yd is markedly skewed for this type of 
situation, involving two observed frequencies (3 heads vs. not 
3 heads, or 6 heads vs. not 6 heads) with a priori determination of 
expected frequencies (by the binomial expansion), and second, 
the discontinuity or presence of gaps in the distribution is more 
marked for the 6 heads than for the 3 heads x^^s. This greater 
discontinuity is due to the smaller expected frequency, 5.47, for 
6 heads, as compared to 27.3 for 3 heads. The student can easily 
demonstrate to himself that the gap between the yd values is 
greater for small expected frequencies. 

Before leaving Table 26, it might be well to point out a connec¬ 
tion between yd and x/o*. Consider the sample for which exactly 
3 heads turned up 15 times. If we express the frequency as a pro¬ 
portion, we have .15, and the universe proportion is p — .273. 



190 Frequency Comparison: Chi Square 

It will be recalled that the standard error of a proportion is 
V^/ViV; in this case the standard error is \/(.273)(.727)/\^ 100 
or .0446. Now, when we take the deviation of the observed pro^ 
portion from the xmiverse value, or p — and divide by the 
standard error, cp, we have (.15 — .273)/.0446 = —2.76 as an 

Table 27, Frequency Distributions for from Successive Samplings 



Exactly 

Exactly 


3 Heads 

6 Heads 

10.50-10.99 

10.00-10.49 

9.50- 9.99 

9.00- 9.49 

8.50- 8.99 


1 

8.00- 8.49 


1 

7.50- 7.99 

7.00- 7.49 

6.50- 6.99 

6.00- 6.49 

1 


5.50- 5.99 


1 

5.00- 5.49 

4.50- 4.99 

4.00- 4.49 

2 


3.50- 3.99 

3.00- 3.49 


5 

2.50- 2.99 

5 


2.00- 2.49 

6 

5 

1.50- 1.99 

2 


1.00- 1.49 

15 

18 

.50- .99 

15 


.00- .49 

47 

62 


— 

— 


93 

93 


x/<r. The square of 2.76 is 7.62. It is no accident that this is the 
value of as given in Table 26 (bottom figure of the sixth column). 
The other values in the sixjbh column are likewise the squares of 
(p — ^)/orp values. This would suggest that the sampling distri¬ 
bution of x^ might be the same as that of (x/< 7 )^, or the distribution 
which would result if the base line distances of the unit normal 
curve were squared. Since all such squared values would be posi¬ 
tive, the distribution would begin at zero with the greatest concen¬ 
tration of frequencies near zero, not unlike the empirical distribu¬ 
tion of Table 27. This, however, is an exceptional case. The 



Empirical Demonstration 191 

student should not jump to the conclusion that every is merely 
an (x/<r)^. 

Suppose we compute x^ for the discrepancies of observed and 
expected frequencies shown in Table 28, wherein will be found the 
distribution of frequencies obtained by tossing 7 coins 1000 times. 
The expected frequencies are from the binomial expansion. Note 
that both the E coliunn and the 0 column sum to 1000 or N, and 
that the (0 — EYs sum to zero. The several contributions to 
X^ are given in the last column, which sums to 7.65, which is x^ 
for the entire table. Two other series of 1000 tosses made by 
students yielded chi squares of 12.52 and 15.02. These 3 values for 
X^ tend to be large when compared with the values in Table 27. One 
might presume that this is due to the larger iV’s, 1000 compared to 
100. That N may not be the factor is suggested when one calcu¬ 
lates x^ for exactly 3 heads vs. other than 3 heads for the data of 
Table 28. We have (267 - 273)V273 + (733 - 727)V727 = .18 
as the value of x^- Note also that for exactly 6 heads vs. other 
than 6 heads, x^ will be zero for this sample of 1000 tosses. These 
two chi squares, based on 1000 cases, are for exactly the same 
situation as the chi squares of Table 26 based on 100 cases, but 
they are no larger. 

Table 28. roa Discrepancies of Expected and Observed Frequencies 
When 7 Coins Were Tossed 1000 Times 


No. of 
Heads 

E ' 

0 

0 -E 

(0 - Ef 
E 

7 

8 

4 

-4 

2.00 

6 

55 

55 

0 

.00 

5 

164 

157 

-7 

.30 

4 

273 

283 

10 

.37 

3 

273 

267 

-6 

.13 

2 

164 

177 

13 

1.03 

1 

55 

45 

-10 

1.82 

0 

8 

12 

4 

2.00 

Sums 

1000 

1000 

0 

7.65 


(N) 

(N) 


(x^). 


The real reason for the larger x^ is the fact that more discrep¬ 
ancies are involved, eight (0 — E)^/E terms vs. two such terms; 
i.e., the magnitude of x^ is partly a function of the number of 
categories or possibilities for discrepancy. This seems plausible 



192 Frequency Comparison: Chi Square 

enough when one considers that, in general, the larger the number 
of (0 — E)^/E terms, the larger their sum. But even this general¬ 
ization is subject to qualifications. The chance sampling distribu¬ 
tion of is a direct function of the number of independent dis¬ 
crepancies or what has been called the degrees of freedom, a concept 
which must be understood before the student can correctly use x^- 


DEGREES OF FREEDOM 

We have already seen that, for the situation involving exactly 
3 heads vs. other than 3 heads, the absolute magnitude of (0 — E) 
was the same for both categories. This means that the two discrep¬ 
ancies are not independent—as soon as one is calculated, the other 
can be written down at once without any further calculation; 
hence 1 degree of freedom exists. If we study the data of Table 
28, we see that, since the discrepancies must sum to zero, all 8 
cannot be independent or vary freely. As soon as 7 are known, 
the eighth is determined. This means that there are 7 degrees of 
freedom for this situation. If we were to roll a die 600 times and 
then compare the observed frequency for 6 spots, 5 spots, etc., 
with the number expected on the basis of a perfectly homogeneous 
(unloaded) cube, we would have 5 possible independent discrep¬ 
ancies, or 5 degrees of freedom. In each of these situations the 
expected frequencies are determinable on the basis of some a priori 
principle, and the only restriction is that the total expected fre¬ 
quency must be the same as the total observed frequency, i.e., 
Ne must equal No- In all such cases the number of degrees of 
freedom (df) is one less than the number of categories. 

The df for other situations in which the x^ technique is applicable 
will follow the same principles as to the number of independent 
discrepancies, but not the rule just laid down. Suppose we con¬ 
sider a two by two or fourfold table such as that given in Table 29 
(which contains fictitious d$,ta for purpose of ease in exposition). 
The expected frequencies are set up on the assumption that there 
is no difference between the 2 groups (the null hypothesis). If 
this were the case, we would expect that the 180 yeses would be 
distributed in the 1 to 2 ratio of the right-hand totals; likewise 
the 120 noes. Note that the expected frequencies reading across, 
i.e., 40 and 60, and 80 and 120, are proportional to the marginal 
totals at the bottom. In determining the df, we can observe either 
of two things: first, that all 4 discrepancies have the same absolute 



Sampling Distribution of 


193 


value, so that when 1 is known the other 3 can be written down 
at once; or second, that in setting up the expected frequencies, we 
are restricted by the requirement that the 2 top-row values must 
sum to Ni, the next 2 must sum across to N 2 y the left-hand column 
must sum to Nn, and the next column to Ny \ as soon as the value 
40 has been ascertained, the remaining 3 expected values become 
fixed. Either way we look at the situation, we see that there is 
but 1 degree of freedom even though there are 4 cells or 4 dis¬ 
crepancies. 

Table 29, and Fourfold Table 


(Expected frequencies in panmtheses) 



No 

Yes 

Totals 

Group 1 

50 (40) 

50 ( 60) 

II 

s 

Group 2 

70 (80) 

130 (120) 

200 = N2 

Totals 

120 

180 

300 = N 


Nn 

Ny 



The fundamental question is: How many of the discrepancies 
are independent? In practice this can be answered by determining 
how many categories or cells can be filled in at will before the 
others become fixed because of the restrictions imposed. If we 
turn back to Table 25 of the last chapter (p. 180), we see that the 
restrictions for a 3 by 3 table are similar to those for a 2 by 2 
table: the expected frequencies must add across and down to the 
observed marginal totals. The student should ponder Table 25 
long enough to see that the proper df is 4. The general rule of 
thumb for ascertaining the degrees of freedom for all contingency- 
type tables of k rows and I columns, where the marginal 
totals are utilized in setting up the expected frequencies, is to 
take df = {k — l)(l — 1). Thus for the fourfold table we have 
(2 — 1)(2 — 1) = 1, and for the 3 by 3 table, (3 — 1)(3 — 1) = 4, 
etc. Such tables need not be square; in fact, very often the psy¬ 
chologist wishes to compare two groups on the basis of k possible 
responses to a question. For this ifc by 2 table, the becomes 
{k — 1)(2 — 1), or simply A; — 1. 

SAMPLING DISTRIBUTION OF 

Before discussing further the applications of x^, we turn again 
to the sampling distribution of this statistic. It is easy enough to 



194 


Frequency Comparison: Chi Square 


see from the coin-tossing situations which we have considered 
above that chance leads to discrepancies between observed and 
expected frequencies. In those situations wherein we wish to 
compare groups, we know from the discussion of sampling in 
Chapter 5 that differences in responses or characteristics can and 
will arise as a result of chance sampling even though the two uni¬ 
verses do not differ. Likewise, contingency tables involving the 
possible relationship between two categorized variables will yield 
varying chance values of even though no real association exists. 
Knowing the chance sampling distribution of for various degrees 
of freedom, we can specify the probability of obtaining a x^ as 
large as any value and conclude therefrom, according to the 
situation, that observations do not agree with hypothesized fre¬ 
quencies or that two or more groups differ significantly or that a 
real association exists. 

We have already suggested that, for 1 degree of freedom, the 
distribution of x^ is the same as for (rc/cr)^. The general equation 
for the x^ distribution * involves an n or the d/, and therefore 
there is no one x^ distribution but a very large number of distri¬ 
butions, one for each value of n. It happens that practical work 
seldom involves more than 30 degrees of freedom, so that we need 
not concern ourselves with all possible distributions. Curves for 
the distribution of x^ can be drawn for various n^s with x^ along 
the abscissa and the ordinates as the y values obtained by the 
equation in the footnote. The area under each curve will be one 
unit, as in the unit normal curve. Figure 16 contains curves for 
7 different values of n or df, so drawn as to be comparable. Note 
that the shapes of these curves and their general locations along 
the abscissa vary with n. 

For n = 1, or for 1 degree of freedom, the curve starts very 
high (strictly speaking, it is asymptotic to the ordinate and hence 
starts at infinity) and drops quite rapidly. For this curve the 
height or y value at x^ = *16 is .92 (not shown). At x^ = *01, 
the height is more than four times greater than .92. By the time 


y =' 


2n/2. Y 


( 1 ) 


n— 2 y 2 


(X*) 


in which r indicates the gamma function as defined in texts in advanced 
calculus. 



195 


Sampling Distribution of 

we reach a of 1.00, the height is .242 (what xja value does this 
height correspond to when the unit normal curve is considered?). 



0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 

Fig, 16. Chi square distributions for various degrees of freedom. , Values of 


along abscissa. 

Then the curve trails off until, at x^ = 0.25, the height is about 
.007. Regardless of n, the right-hand parts of the curves never 
reach the base line; i.e., they are asymptotic. If we think of the 



196 


Frequency Comparison: Chi Square 


total area under any curve as unity, then the area between ordi¬ 
nates erected at any two base line points, or the area beyond any 
point, can be expressed as a proportion of the total. Thus, for 
n = 1, .99 of the area is beyond (to the right of) a value of 
.000157, and only .05 is beyond 3.841. Stated differently, the 
probability of obtaining a x^ value as large as 3.841 is .05; for x^ 
as large as 6.635, P = .01; and the P = .001 point is at a x^ of 
10.827. These hold only for df = 1. 

The curve forn = 2 starts at a height of .50 and then descends, 
but less rapidly than that forn = 1. It is readily seen that large 
values for x^ occur more frequently when n = 2 than when n = 1. 
The P = .05 point is at 5.991; i.e., the probability of obtaining by 
chance a x^ value as great as 5.991 is .05. The .01 point is at 9.210, 
and the .001 point is at 13.815. 

For n = 3, the distribution curve begins at zero height, rises 
sharply to a maximimi (modal value) at x^ = 1, and then falls off 
so that the P = .01 point is at x^ = 11*341. As n is taken larger 
and larger, the distributions become less and less skewed and move 
farther and farther to the right. The mean of a given distribution 
always corresponds to a x^ equal to n, and except for n = 1 the 
modal value is at a x^ of ^ — 2. 

The distributions of x^ for varying n^s are theoretical probability 
distributions. They may be interpreted as random sampling dis¬ 
tributions, and by them one can judge the statistical significance 
of discrepancies. Their use is exactly analogous to testing the 
significance of the difference between means, which it will be re¬ 
called involves setting up the null hypothesis: if there is no real 
difference between two universe means, the D/(td values for suc¬ 
cessive samples will form a normal curve with center at zero and 
with unit variance. If a found difference is 1.96 times its standard 
error, the null hypothesis becomes suspect; if 2.58 times its stand¬ 
ard error, the hypothesis of no difference can fairly safely be re¬ 
jected; if D/<rz> = 3.00, rejection is more definitely indicated. 
These three CP's, it will be recalled, correspond to the .05, the 
.01, and the .003 levels of significance. 

Now x^ can likewise be used to test the null hypothesis. The 
essential difference between the D/<td and the x^ techniques is 
that the latter involves skewed probability distributions; but, 
knowing the distribution for a given n, one can ascertain the 
necessary value of x^ for the .05, the .01, the .001, or other levels 



Sampling Distribution of 


197 


of significance. The statement of the null hypothesis in connec¬ 
tion with may vary slightly according to the given situation. 
If the frequencies in the universe agree with the a priori expected 
frequencies, if the frequencies in two or more universes are the 
same, if there is zero association in the universe between two classi¬ 
fications or variables—^if any such conditions hold for the universe 
or universes, then successive samplings will yield values which 
will distribute themselves in a determinable manner, thus per¬ 
mitting one to specify the probability of obtaining by chance a 
X^ value as large as any given or obtained value. When this 
probability is small, say .01 or less, the null hypothesis is rejected, 
and its rejection implies that there are real discrepancies or real 
differences exist or there is a real association. 

Since the random sampling distribution of x^ depends upon the 
dfj which varies from situation to situation, it is not feasible to 
give a rule-of-thumb criterion in terms of the magnitude of x^ 
which would be deemed significant. If we adopt P = .01 as the 
level of significance we wish to attain, then we need to refer to 
available tables of x^ in order to find how large x^ must be to 
correspond to this level; likewise for any other chosen level of 
significance. Probability tables for x^ are available in two forms. 
One form, Fisher’s (see Table D of the Appendix), gives the values 
of x^ which will be exceeded by chance a specified number of times, 
such as .10, .05, .01, and .001. Elderton’s table, which may be 
found in Pearson’s Tables for statisticians and biometriciansy gives 
the probabilities for obtaining chi squares as large as specified 
values expressed as integers, such as 1, 2, 3 ••*, 21, 22. Both 
tables include varying degrees of freedom. Because of an early 
erroneous notion as to the meaning of degrees of freedom. Elder- 
ton’s table must be entered with df equal to 1 less than his n' 
values, that is, e.g., use n' = 4 when n or df = 3. Elderton’s 
table has one advantage over that given in our Appendix: P values 
as small as .000001 can be ascertained. 

For n’s larger than 30, the expression — \/2n — \ will 

have a sampling distribution which will follow very closely the 
unit normal curve. The probability is accordingly .05 that this 
expression will exceed +1.64, and .01 that it will exceed +2.33, 
by chance. 

Before the possible applications of x^ are summarized, a word 
should be said about the underlying assumptions which restrict 



198 


Frequency Comparison: Chi Square 

its usage. The probability figures in the tables of are based on 
continuous distributions, whereas the chi squares calculated in 
practice form a discrete series. This fact has been pointed out 
earlier in this chapter. A second assumption is that the sampling 
distribution of the observed frequencies about a given E follows 
the normal curve. One can seldom, if ever, check on the tenability 
of this assumption, but it is possible to specify conditions where 
the assumption will not hold. If any one E is small, it is not 
possible to have a normal distribution of O’s about it even though 
the total N is large. For instance, if == 2, the O’s are restricted 
on one side of E to zero and 1, whereas on the other side the pos¬ 
sible values run 3, 4, 5, and upward. Such a curtailment ordi¬ 
narily leads to a skewed distribution for the observed frequencies. 
Now it is obvious that, when E is small, we have a greater amount 
of discontinuity; hence the sampling distribution of observed fre¬ 
quencies will be discrete instead of continuous as called for by the 
normal curve. It would seem, therefore, that small expected fre¬ 
quencies lead to a violation of both the fundamental assumptions 
imderlying the use of Various criteria have been proposed for 
the required size of E. Some say that the x^ technique is inappli¬ 
cable when any one E is less than 10; others say that an E may be 
as small as 5. We would suggest that, when possible, adjacent 
categories be combined so as to have no E less than 10; if such a 
combination is impossible and an E is less than 10 but greater 
than 6, x^ niay be used, providing one is cautious as to the con¬ 
clusions drawn therefrom. A correction for discontinuity when 
df is 1, as in a fourfold table, is available and will be given later. 

APPLICATIONS 

The chief situations for which it is permissible to use x^ inay be 
classified into three types. . 

1. The discrepancy of observed frequencies from frequencies 
expected on the basis of some a priori principle. Such situations 
are most frequently found in genetics, wherein it is hypothesized 
that certain crossings should lead to the presence, in a certain 
proportion of offspring, of some defined characteristic or variation 
thereof. The frequency table for such situations is 1 by /b, with 
fc — 1 degrees of freedom, since the only restriction is that the 



Applications 199 

expected frequencies must sum to N. This type of situation does 
not arise often in research in the social sciences. 

2. Contingency tables. Here we have two types of situations 
which differ only in the methods of classifying. 

a. We may have a contingency table which is analogous to a cor¬ 
relation table in that both classifications are based on continuous 
or ordered discrete variables for which we have only categorized 
information for N individuals. The two variables might be in 
dichotomy (fourfold table), or one might be a dichotomy and the 
other manifold, or both might involve multiple categories. For 
these contingency tables it is meaningful to speak of the correla¬ 
tion between the two variables, and the degree of correlation might 
be appropriately specified by the tetrachoric r or the fourfold 
point r or the contingency coefficient (corrected or uncorrected); 
which measure is used depends upon meeting the requisite assump¬ 
tions. In so far as we are concerned only with we have the 
means for testing the significance of the correlation or association 
as a chance departure from zero or no relationship, and the signifi¬ 
cance test can be used without knowledge of the degree of correla¬ 
tion. Such a test of significance is sometimes spoken of as a test 
of independence—^are the two classifications independent? If so, 

should bo no larger than would arise by chance. If we have 
evidence for correlation or a lack of independence from the x^ 
technique, we can proceed to calculate an appropriate coefficient 
for measuring the degree of correlation or the strength of associa¬ 
tion. The student should, as an exercise, convince himself that 
X^ per se is not a measure of association. 

b. The other contingency-type situation involves classification 
into categories for one variable vs. classification into unordered 
groups for the other, or one unordered grouping vs. another. The 
fundamental problem is apt to be that of comparing two or more 
groups with regard to multiple responses; i.e., we want a test of 
the significance between groups rather than a measure of correla¬ 
tion, which would not be entirely meaningful except in the loose 
sense that a particular response is associated more often with a 
particular group. As previously stated, the df for a Aj by Z con¬ 
tingency table is (k — 1)(Z — 1). 

3. Goodness of fit. If we wish to check on whether it is reason¬ 
able to believe that a given frequency distribution is, \vithin the 
limits of chance sampling, of the normal or some other specified 



200 


Frequency Comparison: Chi Square 

type, a frequency curve having the same basic constants (e.g., 
Ny My and a for the normal curve) as those computed from the 
observed frequency distribution can be fitted to the data. If a 
normal curve is being fitted, the table of normal curve functions 
is used to set up the theoretical or expected frequencies for the 
several grouping intervals. Then can be computed in the usual 
manner. The df will correspond to the number (fc) of grouping 
intervals less the number of constants derived from the data and 
used in the fitting process. For the normal curve the observed 
and theoretical distributions are made to agree as to Ny My and 
<r; hence df = k — 3, An attempt will be made later to explain 
the reasoning back of the determination of df when checking the 
goodness of fit of frequency curves. 

Fourfold contingency tables. For illustrative purposes, let 
us first apply to a couple of 2 by 2 contingency tables for which 
the tetrachoric r, as well as the contingency coefficient, is an 

Table 30. Setup for Computing from a Fourfold Table by Means 

OF A Formula 

A +B 
C -i-D 

A+C D+D N 


A 

B 

c 

D 


appropriate measure of the degree of correlation. Before we do 
this, it might be well to recall that x^ for a fourfold table can be 
computed by a simple formula which does not require calculation 
of the four expected frequencies. Let the fourfold frequencies and 
marginal totals be set up as in Table 30. Chi square can be com¬ 
puted from 


N{AD - BC)^ 

(A + B)(C + D){A + 0(5 + D) 


(85) 


This is simpler than calculation from the discrepancies between 
observed and expected frequencies. The requisite that no expected 
frequency shall be less than 5 still holds. A quick check on this 
can be obtained by multiplying the smaller right-hand marginal 
frequency by the smaller frequency on the bottom margin and 
dividing the product by N. This will yield the smallest expected 




Fourfold Contingency Tables 


201 


frequency. In Table 31 will be found two fourfold tables for 
Stanford-Binet items. Direct substitution into formula (85) 
yields the two chi squares at the bottom of the table. The P 
values are approximately .01 and less than .001, respectively. We 
can be reasonably sure that there is some correlation between the 
first two items, and fairly certain that items 3 and 4 are correlated. 
The value of the tetrachoric r is .40 for each table, and the con- 


Table SI . Applied to Contingency (Fourfold) Tables 


Item 1 


+ 



51 49 100 


= 5.93 
P about .01 


Item 3 


+ 



128 72 200 

x^ = 12.40 
P less than .001 


tingency coefiicient (with no corrections) is .24 for each table. 
Thus we see that the associated with the same degree of 

correlation can be different. Why? Would it be possible for two 
fourfold tables to yield the same P, yet differ in the degree of 
relationship? 

Another application of x^ to fourfold tables is given in Table 32, 
in which the sexes at 4 age levels are compared in performance 

Table 32 . x^ Used to Test Sex Differences in Passing (+) or Failing 
(—) A Binet Item 

Age 6 7 8 9 

-+ - + 



X® 4.30 5.89 .43 5.02 

P <.05 <.02 <.50 <.05 




202 


Frequency Comparison: Chi Square 

on a Stanford-Binet item. None of the values reaches 6.635, 
the value corresponding to the .01 level of significance, but 3 of 
them are large enough to suggest a real sex difference. That a 
real difference may exist is also suggested by the fact that the boys 
are consistently superior at all 4 age levels. This brings us to an 
important property of x^* The several clii squares for independent 
(i.e., based on different samples) tables may be summed to a total 
X^, with df equal to the sum of the d/s for the chi squares being 
summed. Thus for Table 32 we have 4.30 + 5.89 + .43 + 5.02 = 
15.64 as a x^ based on 4 degrees of freedom, by which we can judge 
the significance of the over-all sex differences shown in the 4 tables. 
With x^ = 15.64 and n = 4, we find (from Table D) that P is less 
than .01 (for n = 4, a x^ of 13.28 corresponds to the .01 level). If 
one turns to Elderton^s tables, it can be ascertained that P is 
about .004. In other words, as great a sex difference, considering 
all 4 age groups, would arise 4 times in 1000 by chance; hence it 
would be concluded that a real difference does exist for this item. 

This combinatorial property of x^ is important for all situations 
where frequency data from different groups cannot first be legiti¬ 
mately combined because of age or other differences. It is most 
useful when consistency is present among several comparisons, 
none of which taken singly possesses statistical significance. How¬ 
ever, neither consistency nor insignificance for single comparisons 
constitutes a requisite for using the sum of chi squares as an over¬ 
all test of significance or as a means of arriving at one summary 
probability figure. 

The single age comparisons in the above example could, of course, 
be made by means of percentages or proportions. This could be 
done by either formula (27) or formula (27a) of Chapter 5, the 
discussion of which (pp. 75-77) should be reviewed at this time. 
Formula (27a), which bases the standard errors of the two propor¬ 
tions on the proportion of the combined group showing the charac¬ 
teristic, is to be preferred t6 formula (27). Let us examine the 
connection between the x^ technique and the D/gd for proportions 
method of testing the significance of the difference between two 
groups, the individuals of which have been classified as either 
passing or failing, saying either yes or no, possessing or not possess¬ 
ing a characteristic, etc. All such comparisons begin with a four¬ 
fold frequency table of the type symbolized in Table 30, or an 
equivalent (the frequencies may have been recorded for only one 



Fourfold Contingency Tables 


203 


category of the dichotomy, say the yeses, from which the fre¬ 
quencies for the other category may be readily inferred by sub¬ 
traction). Table 33 contains the basic table of frequencies for 
the presence (+) or absence (—) of a characteristic for groups 1 
and 2, and the basic table of proportions obtained by dividing the 
frequencies by the proper iV^s is indicated. Note that the p and 
q values on the bottom margin are the proportions to use in for¬ 
mula (27a) for the standard error of the difference between pi 

ToWe 3S, Schema for Comparing Groups via and via Difference 
BETWEEN Proportions (or Percentages) 

Frequencies 

+_ - 

A B A + B - ATi 

C D C -\-D N2 

A+C B+D N 

Proportions 

+ _ - 

1 PI = A/Ni qi = B/Ni 

Group- 

2 P2 ^'C/N2 Q2 ~ D/A^2 

p^{A+C)/N q^{B-^D)/N p+9*1.0 

and p 2 * Note also that pi = A/Ni = A/{A + B) and that 
= C/N2 = C/(C + D). 

In order to avoid carrying along a square root sign or radical, 
and for another reason which if not now obvious will soon become 
so, let us write the square of the expression for the critical ratio of 
the difference between the two proportions, pi and p 2 > thus, 

^ (Pi - Pz)^ 

B.,vg 

Ni Ni 

When we replace all the proportions by their equivalents involving 
frequencies and the proper iWs and also substitute frequencies for 


Pi 4- “ 1.0 

P2 4- 92 = 1.0 


1 

Group 

2 




204 


Frequency Comparison: Chi Square 


Ni and N 2 , we have 

_ [A/(A + B)- C/(C + D)? _ 

~ [(A + C)/N]-[iB + D)/N] [(A + C)/Ar]-[(g + D)/N] 

A+B C+D 

(AC + AD- AC- BCf 

_ _ [(A + B)(C + D)f _ 

(A + CKB + D){C + D) + (A + C){B + D){A + B) 
N^{A + B){C + D) 

(AD ~ BCfN^ 

“ ((A + B)(C + D)[(A + C)(B + D)(C + D)\ 

1 +(A + (7)(B + Z))(A + 5)] ) 

(AD - BC)^N^ 

^ (A + B)(C + D)(A + C)(B + D)(A+B + C + D) 

D^ (AD - BCfN 

^ “ (A + B)(C + D)(A + C)(B + D) 

which equals as given by formula (85) for the fourfold table. 
This confirms a fact already mentioned, that for one degree of 
freedom is the same as the square of the critical ratio, but note 
that this exact equivalence holds only when formula (27a) is used 
to calculate the standard error of the difference between propor¬ 
tions. Since formula (27a) is applicable only for comparing pro¬ 
portions based on independent samples, it follows that x^ is simi¬ 
larly restricted. That is, x^ as computed from a fourfold table by 
(85) does not allow for any correlational factor which might be 
introduced because the two'groups consist of paired or matched 
individuals or for the correlational factor which would be present 
if Pi and p 2 (or the corresponding frequencies) were based on the 
same individuals as in a pretest, intervening experience, posttest 
situation. 

Significance of changes. The student should carefully note 
that although the application of x^ to fourfold tables of frequencies 
like that of Table 11 in Chapter 5, which is here reproduced with 



Significance of Changes 


205 


minor changes as Table 34, provides a means of testing the signifi¬ 
cance of the association or correlation between two sets of re¬ 
sponses, such an application does not test the significance of change 


Table 34- Fourfold Table of Frequencies and Proportions for a First 
Set vs. a Second Set of Responses From the Same Individuals 


Frequencies 

2nd 


Proportions 

2nd 


-h 

A -f B 
C +D 
A-\-C B+D N 


A 

B 

C 

D 


+ 


a 

b 

c 

d 


Q2 P2 1-0 


from the first to the second set of responses. This latter test can 
be made by means of formulas (28) or (28a) of Chapter 5, pre¬ 
ferably by (28a), since it is consistent with the null hypothesis 
(that no change would be found if the universe of persons were 
included in the experiment). It is also possible to test the signifi¬ 
cance of any found change by the use of x^- To do this, we first 
note that a net change for the group must necessarily involve the 
difference between the frequencies, A and D, since the B and C 
cases represent those who showed no change. The null hypothesis 
would be that the universe frequencies are not different, i.e., that 
A — D = 0] then for a given sample, A and D would differ only 
as a result of chance sampling. Since A + D represents the total 
number of individuals who changed (the A^s from + to —, and 
the D^s from — to +), in setting up the null hypothesis concerning 
the net change it would seem appropriate to say that, if A + D 
individuals changed, (A + D)/2 would change in one direction 
and (A + D)/2 in the other direction. Thus (A + D)/2 would 
become the expected frequency; then A — (A + D)/2 and 
D — {A + D)/2 would become the discrepancies between ob¬ 
served and expected (on the basis of the null hypothesis) fre¬ 
quencies. If A = D, both discrepancies would become zero. 
Squaring each discrepancy and dividing by E and then summing 
the two quotients or doubling either one will give a which is 
based on one degree of freedom (why one degree of freedom?). 




206 


Frequency Comparisons Chi Square 


A little algebraic manipulation shows that 


(A - D? 
A+D 


(86) 


for the particular situation in which we wish to test the signifi¬ 
cance of over-all change in response. 

With a little additional algebra it can be shown that the of 
formula (86) equals the square of where (td is computed 

by formula (28a). The reason back of the statement given on 
p. 80 that (28a) is inapplicable unless A + D = 10 should now 
be clear to the reader. If A + Z) were less than 10, the two 
would be less than 5, which is an acceptable, though none too con¬ 
servative, lower limit for E. A correction needed when the E^s 
are smaller than 5 will be given later. One thing which may puzzle 
the reader at this time is the fact that formula (86) does not con¬ 
tain a total N. Its algebraic equivalent, {D/an)^, with <td calcu¬ 
lated by formula (28a), does contain A, so the absence of N from 
(86) is more apparent than real. 

The advantage of the over the D/aj) technique for testing the 
significance of net changes in responses lies in the fact that x^ 
values for two or more groups which have been used in an experi¬ 
ment can be summed to a new x^ with n equal to the sum of the 
separate d/^s; in this case n equals the number of chi squares being 
summed. 

Formula (86) is, of course, not restricted to situations involving 
changes in responses. If we have the same individuals giving, 
say, yes or no responses to two different questions and we desire 
to test the significance of the difference between the frequencies 
(or proportions) of yeses or noes, formula (86) is applicable. Or 
suppose we wish to know whether there is a significant difference 
in the difficulty of two test items which have been administered to 
the same group. For example, in Table 31 we have 49 and 68 
individuals passing items 1 and 2 respectively. Since N = 100, 
the proportions are .49 and .68 (or 49 and 68 per cent). By for¬ 
mula (86) we have x^ = (29 — 10)^/(29 + 10) = 9.26, which 
for 1 degree of freedom falls between the .01 and .001 levels of 
significance; hence it would be concluded that the two items are 
different in difficulty. If we use for mula (28a), we g et a critical 
ratio, (p 2 - Pi)Ai) = (-68 - .49)/V(.10 + .29)/100 = .19/.0624 
= 3.04, which leads to the same probability figure as that for a 



207 


Chi Square for 2hy k Tables 


of 9.26. Either method may be used. Both make due allow¬ 
ance for the correlation which is present because the frequencies 
or proportions being compared are based on the same individuals. 

Correction for continuity. We have already pointed out 
that, since the sampling distribution of x^ is continuous, the use 
of x^ when any one E is less than 5 is questionable. For fourfold 
contingency tables, an allowance for discontinuity can be made by 
applying Yates’s correction for continuity, which should always 
be used when any one E in such a table is less than 5 and is advisa¬ 
ble when an E is less than 10. A small E is most likely to occur 
either when the total N is small or when one or both of the marginal 
totals involve extreme dichotomies. It is easy to determine the 
smallest E by dividing the product of the two smaller marginal 
frequencies by the total N, Yates’s correction can be incorporated 
in formula (85), which becomes 

^ N(\AD ~ SCI - N/2f 

X^ =- (85a) 

{A+B){C+ D){A + C){B + D) 

and indicates that the absolute difference between AD and BC is 
to be reduced by N/2, Formula (86) can also be written to include 
a correction for continuity. The corrected form 


(lA -Dl - 1)2 
A+D 


(86a) 


involves decreasing the absolute value of the difference between 
A and Z) by 1. Formula (86a) is to be preferred to (86) when 
A + D is less than 20 and should always be used when A + Z) is 
less than 10. 

Chi square for 2 hy k tables. The calculation of x^ from a 
table with 2 rows and k columns (or 2 colunms and k rows) can be 
accomplished by way of expected cell frequencies calculated as 
previously suggested from the marginal totals or by means of 


B 



^ AtBt L + J 




i + Bi At + 


in which the A’s and J5’s have the meanings indicated in Table 35, 
wherein will be found the frequencies for two groups classified 
according to 5 response categories. The necessary computations 
required by formula (87) are also included in the table. Note 



208 


Frequency Comparison: Chi Square 

that, as usual, the marginal totals are first found by summing 
across and down. Column D is obtained by dividing the entries 
in column B by the adjacent values in column C, and column E 
results from multiplying the D column values by the B column 
figures. These same operations, when applied to the last (or 
totals) line, lead to the column E entry of 49.44, which is the value 
of the B^tKAt + Bi) term in formula (87). Summing the first 5 

Tabu S5, The Calculation op prom a 2 by A; Table: 2 Groups and 
A:(=» 5) Responses 


Col. A 


Col. B 

Col. C 

Col. D 

Col. E 

Group 





t - 


-» 


Bi 


I 


II 

Ai “1“ 







Ai + Bi 

Ai + Bi 

1 27(«Ai) 


15(- Bi) 

42 

.3571 

5.36 

2 26(=A2) 


16(- Bi) 

42 

.3810 

6.10 

3 247(= As) 


110(- Bi) 

357 

.3081 

33.89 

4 41(*A4) 


8(= Bi) 

49 

.1633 

1.31 

5 39(=A6) 


15(= Bi) 

54 

.2778 

4.17 






50.83 

Totals 380(= At) 


164(= Bt) 

» 544(=iV) 

.3015 

49.44 






1.39 

_ 

544^ 

= 4.75: 

X* = (4.75)(1.39) 

« 6.60 








n = 

4, P = .16 




figures in column E yields 50.83, or the S term of (87), and the 
difference between 50.83 and 49.44 is 1.39, the value of the 
bracketed part of the formula. When this is multiplied by 
N^IAtBt, we have which for a d/ of 4 yields a P of about .16. 
In other words, once in 6 trials differences as large as those in 
Table 35 would occur by chance; hence we have insufficient evi¬ 
dence for concluding that the imiverses from which these 2 sam¬ 
ples were drawn differ in regard to their responses to the asked 
question. 

There are a couple of features in Table 35 which should be 
noted. The sums are needed for all columns except D, which 
could be omitted by simply squaring the Bi values in column B 
before dividing by Ai + Bi. Although column D would seem to 
be only a means to an end, it can be made to serve a useful purpose 



209 


Chi Square for 2 by fc Tables 

if one wishes information as to which category yields the largest 
difference between the two groups. The greatest percentage dif¬ 
ference will be for that category for which the ratio in column D 
deviates farthest from the totals ratio (last figure in the column). 
Thus in the example of Table 35 the largest deviation from .3015 
is .1633; hence the fourth category shows the greatest difference 
between the two groups. Direct calculation indicates that 10.8 
per cent of group I and 4.9 per cent of group II fell in the fourth 
category. The difference, 10.8 — 4.9 = 5.9 per cent, is larger 
than that for any other category. 

If one had to depend upon the technique for testing the 
significance of the group differences in Table 35, 5 critical ratios 
would result—^for each category there is a possible difference in 
proportions or percentages with a standard error for each differ¬ 
ence. The 5 might, and usually would, lead to 5 different P 
values with a consequent predicament as to interpretation. Off¬ 
hand, it might be argued that, if any CR or P so determined 
reached an acceptable level of significance, one would be justified 
in concluding that the difference between the groups was real 
rather than chance. That such an argument may be fallacious is 
well illustrated by the data of Table 35, which are actual data. 
When these data first came to the author’s attention, the table 
was in percentage form with a CR worked out only for the category 
showing the largest difference. This CR, based on formula (27), 
was 2.54, which is near the P = .01 level of significance, and it 
had accordingly been concluded that a real difference had been 
found. Now, when we consider the P of .16 for the over-all 
comparison, we are not justified in placing much confidence in 
such a conclusion. 

Why the apparent inconsistency between two tests of signifi¬ 
cance? Since most investigators are looking for group differences 
rather than group similarities, there is the tendency to single out 
a category for comparison not because of intrinsic a priori interest 
in that category but because it happens to yield the largest dif¬ 
ference. By this a posteriori selection one tends to capitalize on 
differences which may be large mainly as a result of chance. A 
similar situation occurs when we have the means for several 
groups—^the largest of the possible differences may be the largest 
partly or entirely as a result of chance. As will be seen in the 
discussion of the analysis of variance in Chapter 13, before any one 



210 


Frequency Comparison: Chi Square 

difference is tested, an over-all test of significance should be applied. 
If this over-all test yields a significant P, then and only then is one 
justified in proceeding to an examination of single categories. 
Thus the use of for such situations as are exemplified in Table 35 
not only provides an over-all single index of significance but also 
helps us avoid false conclusions. 

Application to fe by f tables. Consider the data of Table 36, 
which contains a contingency-t 3 rpe table involving 3 groups and 
3 possible opinion responses. To test the significance of the dif¬ 
ferences between the groups by use of the CR technique would 
involve comparing the percentages for group I vs. II, I vs. Ill, 
and II vs. Ill, for each of the 3 responses—a total of 9 C/?'s. 
Even though there is no short-cut formula for computing x^ for 
such a table, its calculation is far quicker than the determination 
of 9 CR's. Straightforward computation gives x^ = 36.58, which 
for d/ = 4 is double the value of the x^ needed for the P == .001 
point. From Elderton’s table we find that P is about .000001; 
hence Table 36 as a whole exhibits highly significant differences 
between the groups. 

Table 36. Table op Frequency op 3 Possible Responses for 3 Groups 
or Individuals—Percentages in the Parentheses Add Downward 

to 100* 


Motivation of 
Conscientious 


Group 



Objectors 

I 

II 

HI 

Total 

Not cowards 

24(27.0) 

56(53.8) 

71(69.6) 

151 

Partly cowards 

30(33.7) 

23(22.1) 

19(18.6) 

72 

Cowards 

35(39.3) 

25(24.0) 

12(11.8) 

72 

N’s 

89(100.0) 

104(99.9) 

102(100.0) 

295 


* Data from Leo Crespi, J. Psychol., 1945, 19, p. 285. 

Perhaps a better understtoding of the extent of the differences 
can be had by considering the percentages given in parentheses in 
the table. Membership in group III means a greater tendency to 
the ‘'not cowards'^ response. Group I tends more to give the 
“cowards^^ response. Now it happens that the three groups, I, II, 
and III, can be (and are) placed in an ordered series for amount 
of education: grammar school, high school, and college respectively. 
Thus the association shown in the table is in the direction of less 



CkMMlness of Fit 


211 


disparagement of conscientious objectors by those in the higher 
educational level. The strength of association or degree of correla¬ 
tion is represented by a contingency coefficient of .33, which may 
seem rather low in light of the highly significant P- This illus¬ 
trates a point which most readers will already have grasped: high 
statistical significance and a high degree of association are far 
from synonymous. Consideration of the data of Table 36 readily 
indicates the difficulty of predicting responses when the extent of 
association is represented by a C of .33. 

As in the 2 by fc table, so here it is better to calculate an over-all 
before examining by the CR technique any of the possible sepa¬ 
rate comparisons. Unless the x^ P is significant, it is unwise to 
proceed with such comparisons. 

Goodness of fit. The use of x^ bi testing the goodness of fit of 
a theoretical curve to an observed frequency distribution is illus¬ 
trated in Table 37. One starts with an actual distribution, usually 
with more grouping intervals than in our example, and the descrip¬ 
tive statistical measures therefor. In fitting the normal curve 
to the distribution of Table 37, we need AT, Af, and <r. To set up 
for each interval the frequency which would hold for the best¬ 
fitting normal curve, we go through the tedious process of deter¬ 
mining the proportionate area under the theoretical curve for 
each interval. Once the proportions are known, each is multiplied 
by N to secure the expected frequencies. The proportions are 
ascertained by calculating the x/a value of the boundary limits of 
the intervals. For example, the 110-119 interval may be thought 
of as running from 109.5 to 119.5, since IQ's are rounded to the 
nearest integer. Then (109.5 — 104.56)/16.99 = .2907 as the 
x/<T for the lower limit, and (119.5 — 104.56)/16.99 = .8793 as 
the x/a for the upper limit of the 110-119 interval. Of course, 
.8793 is also the lower limit for the 120-129 interval. Now the 
difference, .8793 -- .2907 = .5886, is the same as 10/16.99 or t7<r, 
which is the interval width expressed in x/tr units. Adding .5886 
once to .2907 gives .879 (it is sufficient to retain three decimals); 
adding it twice gives 1.468; and so on. Then subtracting .5886 
once from .2907 gives —.298; subtracting twice gives —.886; etc. 
When the boundary limits in terms of x/<t have been set up, the 
proportionate area for a given interval is found by using the table 
of normal curve areas. The two ton intervals have been combined, 
and likewise the three bottom intervals, so as to have no expected 



212 Frequency Comparison: Chi Square 

frequencies less than 10. The proportionate areas, .0041 and .0040, 
represent the areas beyond given points, and the at top and 
bottom are the number of cases expected beyond these same 

Table 37, Goodness op Fit op Normal Curve to Stanpord-Binet IQ's, 

Form M 


Proportionate 


IQ 

0 

x/a 

Area 

E 

1 

o 

(0 - Ef/E 

160 







150 

13) 16 

2.645 

.0041 

12 

4 

1.33 

140 

55 

2.057 

.0158 

47 

8 

1.36 

130 

120 

1.468 

.0512 

152 

-32 

6.74 

120 

330 

.879 

.1186 

352 

-22 

1.38 

110 

610 

.291 

.1958 

582 

28 

1.35 

100 

719 

-.298 

.2316 

688 

31 

1.40 

90 

592 

-.886 

.1950 

579 

13 

.29 

80 

338 

-1.475 

.1177 

350 

-12 

.41 

70 

130 

-2.064 

.0506 

150 

-20 

2.67 

60 

48 

-2.652 

.0155 

46 

2 

.09 

50 

7] 12 


.0040 

12 

0 

.00 

40 

4[ 






30 

ij 







2970 - N 


.9999 

2970 

0 

17.02 - 


M « 

104.56 

d/ = 11 - 

CO 

II 

00 

P = .03 



<r = 16.99 


points. Note that the sum of the proportions should be unity 
within limits of rounding errors, and that the sum of the expected 
frequencies should be the same as the sum of the observed fre¬ 
quencies. Perhaps it is unnecessary to point out that the expected 
frequencies form an exactly (within limits of rounding errors and 
for the given intervals) normal distribution which will yield the 
same M and a as the observed distribution with which we started. 



Goodness of Fit 


213 


Straightforward calculation gives a of 17.02. With 
d/ = 11 ~ 3 (number of intervals minus the number of constants 
used in the fitting), P = .03, i.e., only 3 times in 100 would as 
large a arise by chance, or only 3 times in 100 would we get a 
worse fit if the universe of IQ^s were distributed as a normal curve. 
This would lead one to question whether IQ^s, as measured by 
Form M of the 1937 Revision of the Stanford-Binet, are distri¬ 
buted in the normal curve fashion. The same data with intervals 
of size 5 give a x^ -P of .003, and the degree of kurtosis (by moments) 
is thrice its standard error; therefore one can conclude that the 
observed distribution is not a chance departure from a normal 
distribution. 

Thus the x^ technique provides us with a test by means of which 
we can judge that the frequencies of a given distribution do not 
follow the frequencies of a theoretical curve closely enough to be 
regarded as chance departures therefrom. Note that a smaller 
value for x^ for the example of Table 37 would not prove that the 
universe is normal even though the P were as large as .90 or .95. 
This would merely indicate that the given data were consistent 
with the normal distribution. As a matter of fact, so-called 
excellent fits leading to P^s of .99 or more are suspect. When 
P = .01, it is said that chance sampling would lead to a worse 
fit only once in a hundred times; when P = .99, it is said that 
chance sampling would lead to a better fit only once in a hundred 
times. In other words, if P is between .05 and .01, the hypothesis 
that the universe distribution is of the normal type (or whatever 
type was fitted) is questionable; if P is .01 or less, this hypothesis 
is rejected; if P is between .95 and .99, one may suspect the fit as 
being too good; if P is .99 or more, one should definitely look for 
an error in calculation or for some type of restraint on the operation 
of chance. Too good a fit is as open to question as too poor a fit. 
If P is between .05 and .95, the fit is said to be satisfactory. 

When one is testing the goodness of fit of frequency curves, the 
df depends upon the number of grouping intervals and upon the 
number of restrictions imposed or the ways in which the'expected 
distribution is made to agree with the observed distribution. The 
general principle back of the determination of df for x^ as a test of 
fit may be illustrated for the case of testing the goodness of fit of 
the normal curve. The expected and observed distributions are 
made to agree with respect to iV, M, and <t. Suppose that we have 



214 Frequency Comparison: Chi Square 

k grouping intervals and that we let stand for the frequency in 
the ith interval and Xi for its score value (midpoint), and that Xi 
represents the corresponding deviation score value for this mid¬ 
point. Then the following equations will hold: 

fi + /a + /a H-h /i H-h /jb = N 

fiXi + / 2 X 2 + / 3 X 3 + • • • + fiXi + •.. + = iVM 

fl^^l + /2^^2 + H-h fi^^i + *-h fk^^k = N(T^ 

Now, if all the / values were known except /i, / 2 , and /a, those 
parts to the right of the term containing /g in the first equation 
could be added numerically. The resulting sum could be shifted 
to the right of the equality sign and then combined numerically 
with Ny giving an equation of the type /i +/2 +/3 = A, where 
A equals iV’ minus the sum of all the frequencies save the first three. 
Likewise, the parts beyond the fs term in each of the other two 
equations could be summed munerically, shifted to the right, and 
combined numerically with the constant, NM for the second and 
Na^ for the third equation. 

This procedure will lead to three simultaneous equations with 
fif S 21 and /3 as the unknowns: 

/i + /2 + /a = 

i\X\ + / 2 X 2 + ~ 5 (say) 

== c (say) 

It is a well-known principle of algebra that 3 equations in 3 un¬ 
knowns will be satisfied (if solvable) by just 1 set of values for the 
unknowns. For our particular problem, this means that, as soon 
as the frequencies for all but 3 (any 3) intervals are known, these 
3 remaining frequencies are not ^‘free to vary”; they are fixed 
because of the requirements that the frequencies or functions 
thereof must add to iV, iVilf, and We accordingly lose 3 

degrees of freedom, and therefore when we are testing the fit of a 
normal curve to a distribution with k intervals, the d/ is fc — 3. 

If we wished to ascertain whether the observed distribution of 
Table 37 could be thought of as a chance departure from a normal 
curve with mean equal to 100 , the expected frequencies would be 
so set up as to yield the observed <r and Ny but with M = 100. 
The df would therefore be 11 — 2, since the distributions are 



Goodness of Fit 215 

forced to agree only as to 2 constants, N and a; hence 2 degrees of 
freedom are lost. 

Chi square can be used to test the significance of the difference 
between 2 observed frequency distributions, but this simply be¬ 
comes a 2 by A: table with expected values computed from the 
marginal totals as previously indicated. In such a situation, it is 
incorrect to treat either set of frequencies as those expected, against 
which the other is compared as a set of observed values. Such a 
procedure does not allow for the fact that both sets of frequencies 
are subject to sampling fluctuations. If one set of frequencies is 
for the imiverse, and the second set is based on a sample from the 
universe, then the universe frequencies (or proportions) can be 
used to set up expected frequencies, against which the sample 
values may be checked in order to test whether the sample repre¬ 
sents the universe within the limits of chance sampling error. The 
df becomes fc — 1, since this requires only that Ne = No- 

In this chapter we have discussed the essential nature of x* and 
have pointed out typical applications. By now the student should 
appreciate the advantages of x* over percentage comparisons and 
have some insight into the use of x* as a means of testing hypoth¬ 
eses. 



CHAPTER 12 


Small Sample Methods 


The sampling error formulas given in Chapter 5 and subse¬ 
quently are applicable and interpretable by means of the normal 
curve only when N is greater than 30. In the case of the 
requirement as to sample size is that N be sufficiently large to 
avoid the use of expected frequencies of less than 5. In this 
chapter the allowances necessary in judging statistical significance 
when small samples (iV^s of less than 30) are involved will be 
given. We shall confine our attention to the mean (including the 
mean of differences), the difference between means, the product 
moment correlation coefficient, and the difference between stand¬ 
ard deviations or variances. For the first three of these, the usual 
critical ratio (such as D/<rB or r/ar) will be replaced by a new 
ratio, commonly symbolized as t, which differs from CR in two 
important respects: a refined estimate of the standard error is 
utilized, and the sampling distribution of t does not follow the 
normal curve. We will again have the concept of degrees of free¬ 
dom but in a somewhat different way from that in connection 
with The reader should be warned in advance that the small 
sample techniques definitely assume that the universe of values for 
the trait being studied, and as measured, forms a normal distri-- 
bution. 

In testing the significance of the difference between the standard 
deviations or variances for two small samples, a ratio other than 
t is needed. This ratio will be introduced in this chapter, but its 
major and extensive applications will be reserved for later chap¬ 
ters which have to do with the ''analysis of variance.^' The 
analysis of variance technique may be thought of as an extension 
of the t technique in that the significance of the difference between 
two or more means can be summarized in one probability figure. 

216 



The Sampling Distribution of t 


217 


THE SAMPLING DISTRIBUTION OF t 

It will be recalled that the sampling distribution of the mean is 
normal, when the trait distribution is normal, regardless of the 
size of the samples, and that it is nearly normal for large samples 
drawn from moderately skewed parent distributions. The sam¬ 
pling distribution of means will center about the population or 
universe mean, and the standard deviation of the distribution will 
be <^/\/iV, where & is the standard deviation of the trait for the 
universe. With M distributed normally about iVf, the universe 
value, it follows that the deviations, M — ill, will be distributed 
normally about zero. Now, if each such deviation is divided by 
the resulting ratios will also be distributed nor¬ 
mally about zero, with unit variance. That is, the deviations 
about M divided by &m will follow the normal curve with SD = 1. 
When successive samples are drawn and a (tm is computed for 
each sample mean by using the sample standard deviation instead 
of the unknown <r, the ratios of given (M — Mys to their aju 
values so computed ^vill be distributed normally for very large 
and approximately so for N^s of moderate size, but for W's as 
small as 30 the approximation is none too good. The value 30 is 
arbitrarily chosen—the approximation becomes progressively 
worse as we go from large to small rather than becoming sud¬ 
denly worse in the vicinity of iV = 30. 

There are two reasons why the sampling distribution of 
(M — SI)I(tm for small Ws does not follow the normal curve when 
each CM is ba^d on a sample standard deviation. Actually, when 
we use aly/N in the place of we are thinking of a as an 

estimate of &, This estimate suffers from a bias which is negligible 
when N is large but becomes sizable when N is small. The second 
reason for the failure of the sampling distribution of {M — M)/<tm 
to follow the normal curve is the fact that the successive sample 
values of <t are skewed in distribution when N is small. Thus the 
values of <tm computed from successive samples will also yield a 
skew distribution. The successive sample values of (ilf ~ M)I<tm 
will accordingly involve a variable numerator which is normally 
distributed and a variable denominator which has a skew distribu¬ 
tion. The distribution of the resulting ratios will be symmetrical 
about zero but steeper than the normal curve, and the smaller 
the N the more leptokurtic the shape of the curve. The chance 



218 Small Sample Methods 

distribution of t yields more large values than does the normal 
distribution. 

There are a number of statistical measures which, when ex¬ 
pressed in some form of deviation and divided by an csUtnotcd 
sampling error, follow the t distribution. The mathematical equa¬ 
tion * for this distribution includes an n which is defined as the 
number of degrees of freedom involved in the estunate of sampling 
error. Figure 17 shows the curve for t when n = 7 and when 



-4 -3 f-2 -1 0 +1 +2 +3 +4 


Fig. n. Normal compared with t distribution forn ■= 3 and n = 7. 

n = 3 as compared to the normal curve. For n larger and larger, 
the curve of t approaches that of the normal distribution. The 
table for the areas under the curve of t gives the values of t, for 
n’s of 1 to 30, which will be exceeded by chance a specified propor¬ 
tion of times. Thus for n = 30 we see from Table E of the Appen¬ 
dix that the P — .05 point is at a f of 2.04 as compared with a 
normal deviate of 1.96. For n = 10, the point corresponding to 
the .05 level is ( = 2.23. The .01 level is at t = 2.75 for n = 30, at 
3.17 for n = 10, as compared with 2.58 for the normal curve. 

Degrees of freedom. The n of the t distribution and in the 
table of t is the d/ for the estimate of sampling error. This df is a 
fimction of N, as contrasted with the df for which is a function 
of the number of categories. In both cases the df depends upon 



Estimation of Variance 


219 


the number of independent values, which in turn depends in part 
upon the number of restrictions involved. Suppose that we con¬ 
sider the df for the sampling error of the mean. Actually, since 
the standard error of the mean depends upon N and <r, the df has 
to do with the estimate of the needed standard deviation. By 
definition the variance (<r^) of a set of scores or measures is given 
by or S(X — the numerator of which is the 

sum of squares (of deviations from the mean). Suppose two 
scores, 3 and 5. Their mean is 4, and the sum of squares is 
(3 — 4)^ + (5 — 4)^ = 2. It will be recalled that Sx = 
S(X — ilf) = 0; therefore as soon as one deviation is known the 
other X is determinable. Thus, if Xi is — 1, the other deviation X 2 
must satisfy the equation — 1 + X 2 = 0. One deviation and its 
square can be thought of as dependent upon the other deviation, 
which has some independence, hence 1 degree of freedom. Sup¬ 
pose that we have three scores, 3, 4, and X, which yield a mean of 
4. The deviations must satisfy the requisite that they sum to 
zero; i.e., (3 — 4) + (4 — 4) + (X — 4) = 0. Thus one of the 
three deviations is fixed by the other two, i.e., is not independent 
of their values, because the three deviation scores must sum to 
zero. 

It may be more enlightening to start with symbols for scores. 
Suppose that Xi, X 2 , X 3 , and X 4 represent four scores, and it is 
reported that their mean equals 40. How many of the four devia¬ 
tion scores can we assign at will? Stated in deviation units, we 
have (Xi - 40) + (X 2 - 40) + (X 3 - 40) + (X 4 - 40) as a 
sum which must equal zero. It is readily apparent that only three 
deviations can ‘Vary freely^^—^the fourth is fixed by the numerical 
values of the other three. Hence d/ = 4 — 1 ; i.e., 1 degree of 
freedom in the deviations or their squares is lost because of the one 
restriction imposed. The df for a sum of squares about a mean is 
always N — 1 when N scores have been used to compute the 
mean. In general, the df for the sum of squares is equal to the 
number of squares minus the number of restrictions imposed by 
constants computed from the data. 

Estimation of variance. We have already mentioned that 

= Sx^/X as an estimate of the universe variance suffers from a 
small bias. The mean of the values from a large number of 
samples tends to be slightly smaller than the variance for the uni¬ 
verse or population from which the samples are drawn. If, in- 



220 Small Sample Methods 

stead of deterniining from we use — 1), this 

bias disappears. It is conventional to take the square root of the 
latter expression as an unbiased estimate of the population stand¬ 
ard deviation. It is important to note that (1) the difference 
between the ordinarily computed <r and the best estimate of & is 
small for N large, and (2) instead of dividing the sum of squares 
by N, the best estimate involves dividing by the df. The use of 
df in the place of N is quite universal in connection with small 
sample and analysis of variance techniques. Hence the user of 
these techniques must be able to figure out the df for a given 
situation. 

Note on notation. The use of symbols in statistical methods 
has not been entirely standardized. Consequently the student, 
when reading research literature or consulting various works on 
statistical techniques, must be in a position to translate from one 
set of symbols to another. We have purposely kept the notation 
as simple as possible, and up to this point the symbols have agreed 
more or less with those of writers who follow the Pearson tradition. 
Contemporary notation tends to be patterned after that of 
R. A. Fisher, though not entirely so. For example. Professor 
Fisher uses x in the sense that most writers use X to indicate a 
score or original measurement. Fisher and others have also fol¬ 
lowed a system of notation in which Greek letters represent popu¬ 
lation parameters, and the corresponding sample statistics are 
designated by roman equivalents. Such a system has its merits 
in mathematical statistics, and in practical work it is sometimes 
needed for clarity. Indeed, sometimes three symbols arc needed. 
Take, for example, the notation for standard deviation or variance. 
We need a symbol for the population value, one for the sample 
value, and one for the best available estimate of the population 
value. We have used & and a respectively for the first two; fol¬ 
lowing Fisher, we shall use s as the symbol for the best estimate of 
&. For N large, the difference between a and s is negligible, since 
ordinarily the former involves dividing the sum of squares by A, 
the latter by iV — 1 or the df. We shall see that the fundamental 
comparison in the analysis of variance is between two estimates 
of a population variance where, as a rule, the difference between 
using N and df is of some consequence. 

In connection with the t technique and the analysis of variance, 
it is convenient to use the symbol X instead of M for the mean. 



The t Technique and a Single Sample Mean 221 

Thus the sum of squares would be written as S(X — X)^. Appro¬ 
priate subscripts may be used to designate different groups. These 
will be introduced as needed. 

The t technique and a single sample mean. The basic prin¬ 
ciples in the interpretation of Sm or (the estimated standard 
error of the mean) are the same as those in the discussion on 
pp. 51-59. Inste ad of ctm = <r/ v^, w e have sx = s/y/N, where 
s is defined as \/s(X — X)^/(N — 1). One can write the com¬ 
plete expression for the estimated standard error of the mean as 




4 


Vn 


S (Z - If 
N - 1 

Vn 


( 88 ) 


From this it is readily seen that sx = ^/Vn — 1. 

We can test the significance of a given Z as a deviation from any 
preassigned value, say K, by taking t = {X — K)/sx\ then, by 
entering the table of t with n = df = W — 1, the probability of as 
large a deviation can be ascertained. Unless the P is less than 
.05 or .01 or whatever level of significance one chooses, the devia¬ 
tion would be attributed to chance sampling. If one wishes to 
specify the confidence limits for the unknown population mean 
and to do so with a level of confidence indicated by P = .99, he 
first notes from the table of t how large t must be, for the given d/, 
to correspond to the .01 probability level. Then X plus and minus 
the tj so found, times sy will give the desired limits. For example, 
suppose 9 cases yield a mean of 80 and a sum of squares of 1152. 
Dividing the sum of squares by d/, or 8, we get = 144, s = 12 
as an estimate of and sx = 12/\/9 = 4. For 8 df we find from 
Table E that t = 3.355 for the .01 level. Then 80 zb (3.355) (4) 
gives 66.58 and 93,42 as the .99 confidence limits for the popula¬ 
tion mean. If we used the methods of Chapter 5, we would have 
<r^ = 1152/9, giving <r as 11,31, from which we get (Tjif = 11.31/\/9 
= 3.77. Referring to the normal table, we find that a rela¬ 
tive deviate of 2.575 corresponds to the .01 level; hence 
80 zb (2.575) (3.77) gives 70.29 and 89.71 as the .99 confidence 
limits for the universe mean. Thus, when proper allowance is 
made for the smallness of the sample, an appreciable difference in 
the so-called confidence interval is found. 



222 


Small Sample Methods 


THE COMPUTATION OF OR « 

It is appropriate at this point to indicate economical methods 
for calculating s^. If has already been computed, it is easy to 
get 8 ®, since from the relationship, = Z(X — X)^/N, we get 
= 2(X — Dividing the right-hand side by iNT — 1 

(or the df) gives s*; hence dividing the left-hand side by iNT — 1 
will also give s*, i.e., 

N 

(89) 

N -1 

But in order to compute <r® we need the sum of squares; once we 
have this sum we can proceed directly to s® by substituting in 
S(X — X)^/(N — 1). Since the situation in which s is preferred 
to <r always involves a small number of measures, it is not eco¬ 
nomical to make a frequency distribution and do the computa¬ 
tions in terms of deviations from an arbitrary origin. What we 
need is a method for computing the sum of squares directly from 
gross measvues or scores without the necesrity of determining each 
individual deviation of the type X — X, which would involve 
fractional values unless X happened to be an integer. 

If we expand the expression for the sum of squares (of devia¬ 
tions from the mean), we have 

S(X - Xf = SX® - 2X^X -f- ATZ® 

Since by definition X = SX/W, this becomes 

, 2X /SX\® , (SX)® (SX)® 

SX ®-2 - -LX + Nl —) =SX®- 2 ^^- - + - - - 

N \n/ N N 

„ , , (SX)® 

S(X - X)® = SX® - ^ (90) 

N 

This obviously can be written as 

S(X - X)® = - [i\rSX® - (SX)®] (90o) 

N 

which is better adapted to machine calculation. [The student 
may note that dividing both sides of (90a) by N and then taking 
the square root leads to formula ( 6 o) for computing <r from gross 



Difference between Uncorreiated Means 


223 


scores.] Formula (90) or (90a) is the fundamental scheme for 
determining the sum of squares. Note that two sums are called 
for: the sum of the squared gross scores, and the sum of the gross 
scores. This latter sum is, of course, needed for determining X. 


DIFFERENCE BETWEEN UNCORRELATED MEANS 


Suppose two groups of Ni and N 2 cases, and that we wish to 
test the significance of the difference. Dm = Xi — X 2 . By the 
procedure of Chapter 5 for large JV^s, we would make the necessary 
calculations for determining or CR. As an aid to transition 
in thought from CR to i, let us first write the expression for CR, 
thus 

(jjl = ~ ^2 

^Dm + 0-^X2 




4 . 


, 2 2 

Ni N2 


which involves the two sample variances. Now, for the small 
sample situation, we have t = Dm/^dmi where Sj)^ is to be the 
best possible estimate of the standard error of the difference. To 
get this we apparently need the best possible estimates for the 
two variances of the two populations from which the samples 
have been drawn. It will be recalled that, in testing the signifi¬ 
cance of the difference between two means, the null hypothesis is 
set up. This implies that the two samples have been drawn from 
populations having the same parameters, i.e., that the two uni¬ 
verse means are equal and also that = & 2 - It is sometimes 
stated that, in using the null hypothesis, one supposes that the 
two samples have been drawn from the same universe, but this 
may seem scarcely realistic without some qualification or amplifi¬ 
cation. We might have two samples which come from popula¬ 
tions obviously different in certain observable characteristics. 
What is really meant is that the samples have been dra\vn from 
the same universe of values for the trait under consideration. 
Either way we regard this, it is seen that an estimate of just one 
universe variance is needed. Calling this estimate by analogy 



224 


Small Sample Methods 


with the CB technique we have 

(91) 


(91a) 


T he radical term is sometimes written in the equivalent form, 

V(iVi + N2)/NiN2. 

The best estimate of is obtained by computing the sum of 
squares separately for the two samples, then combining the sums, 
and dividing by the proper d/, or 

S(X ^ 

NI + N2-2 

The two separate sums may be computed by either formula (90) 
or (90a). Note that 2 degrees of freedom are lost because the sum 
of squares is about 2 means, which leads to 2 restrictions. Once 
5 ^ has been calculated, either (91) or (91a) can be used to obtain L 
Formula (91), which involves doing only 1 square root, is pre¬ 
ferable unless « is needed as a descriptive measure of dispersion. 
When t has been determined, one turns to the table of t with df 
or n equal to iVi + iV ’2 — 2 in order to see whether it reaches a 
chosen level of significance, say the point corresponding to P = .01. 
If df happens to be greater than 30, t is interpreted as a CR; i.e., 
the normal table is used. The ultimate interpretation depends 
upon the P value and hence is no different from the interpretation 
placed on a P arrived at by either the CR or the technique as a 
test of the significance of the difference between sample values. 

There is one point in the method of estimating for testing the 
significance of the difference between means which may have 
puzzled the student. The setting of the null hypothesis implies 
that the two samples have been drawn from a single universe or 
from two universes which have the same mean and equal variances. 
It might accordingly be assumed that the best estimate of the 
population variance would be obtained by taking the sum of 
squares about the combined mean rather than about the separate 




Difference between Correlated Means 225 

means. The former would give a better estimate of the variance 
if it were really known that the two universe means were the same 
(or that only one universe was involved), but there is always the 
possibility that the two universe means really differ; and if this 
were true, the taking of the sum of squares about the combined 
mean would, in general, yield too large an (for reasons which 
the student should figure out as an exercise). It follows, therefore, 
that in the long run the best estimate of will be provided by 
summing the sums of the squares about the two means. This 
points up another assumption, additional to the assumption of 
trait normality, namely, that the two universes even if different as 
to means are assumed to have the same variance. 

It should be noted that, when information is available only for 
small samples, it becomes exceedingly difficult to be sure of the 
assumption of trait normality. Tests for normality are not sensi¬ 
tive enough to lead one to reject, on the basis of a small sample, 
the hypothesis of normality unless the departure therefrom is 
very marked. Likewise, the as yet undiscussed test for a possible 
difference between variances is too insensitive when used with 
small samples to lead to rejection of the hypothesis of equal vari¬ 
ance unless the difference between the two universe variances is 
sizable. The foregoing statements are, of course, based on the 
proposition that by statistical methods one can prove, at a de¬ 
sired level of significance, that a sample distribution did not arise 
from a normally distributed universe or that two universe values 
are different, but such methods will not prove normality nor prove 
that no difference exists between imiverse values. 

DIFFERENCE BETWEEN CORRELATED MEANS 

If the two means to be compared are based on the same indi¬ 
viduals or on matched cases, the test of significance must make 
allowance for the fact that the two sets of scores are not random 
with respect to each other. In Chapter 5 we saw that this could 
be done by including the r term in the standard error of the differ¬ 
ence, as in formula (26), or by working directly with the differences 
between the paired scores. It will be recalled that Md = Dm 
and that cmd = When we have small samples, it is really 

easier to work with Md, the sigma of the distribution of differences 
between paired scores, and thence l^he notation being 



226 Small Sample Methods 

used in this chapter we would have D = X 2 — Xi, and JD = X 2 — 
Xi. To get the best estimate of the sampling error of D, we would 
need the sum of squares of the deviations of the pair differences 
from the mean difference, i.e., 2)(Z> — Z))^, which when divided 
by the proper df or N — 1, where N is the number of differences or 
the number of paired scores, would give the best estimate of the 
variance of the universe distribution of differences. Let stand 
for this estimate. Then 


sd 


4 


's(Z) - Dy 

N - 1 

Vn 


(92) 


would be the best estimate of the sampling error of 15. 

The computation is so straightforward as to render needless a 
numerical example. Each of the Z)^s is the difference between two 
scores, the subtraction being made in the same direction for all, 
and the sum of squares, 2(Z) — 25)^, is obtained by using formula 
(90) or (90a) with the D^s treated as X^s; i.e., the D’s are summed 
(needed to get the mean); and their squares are summed. After 
Su has been calculated, we get t as The hypothesis to be 

tested is that the universe value of jD is zero; the table of t is en¬ 
tered with the obtained t and with d/ = X ~ 1 in order to deter¬ 
mine the probability of obtaining by chance as large sl D as that 
observed. 


SIGNIFICANCE OF CORRELATION, SMALL SAMPLES 

It can be shown that, if the correlation coefficient is computed 
for successive samples drawn from a population for which the 
correlation is zero, the successive values of 

Vn - 2 r 

ViV - 2 

will follow the t distribution with d/ = iV — 2. If a sample t 
reaches the .01 level of significance, one would conclude that it is 
not a chance deviation from zero, or that some correlation exists 
between the two variables involved. 



227 


Significance of Correlation, Small Samples 

From the foregoing expression, it would appear that the t for 
testing the significance of correlation is nothing more than an 
r/Sfy with Sr = V(1 — r'^)/{N — 2) as an estimate of the sampling 
error of r. However, there are subtle mathematical reasons why 
such an interpretation is not permissible. 

The student may wonder why the df is taken as iV — 2. Actu¬ 
ally, when we test the significance of an r, we are testing the 
significance of regression. If r is zero, the regression is zero in the 
sense that the regression coefficient or slope of the regression line 
is zero. Now a linear regression line involves 2 constants, its 
slope and its intercept; hence 2 degrees of freedom are lost in 
fitting the line. Suppose A = 2, and that the two X scores differ; 
likewise, the two Y scores. Imagine these pairs of scores plotted 
in a scatter diagram, and a regression line fitted or a correlation 
coefficient computed. The regression line would go through both 
plotted points; therefore for the sample of 2 cases the prediction 
would be perfect and r would be imity. The student may, as 
an exercise, prove algebraically that, when N = 2 and when there 
is variation in both X and 7, the correlation must be +1 or — 1. 
In other words, with N = 2 there is no freedom for sampling 
variation in the numerical value of r. 

The partial correlation coefficient based on a small sample can 
also be tested for significance by the t technique. If one variable 
has been eliminated, ^e have 


^ 2-3 



with d/ = iV — 3. An additional degree of freedom is lost for 
each additional variable eliminated. A test of the significance of 
multiple correlation will be given later. 

Although the t technique is preferable to the r to z transforma¬ 
tion discussed in Chapter 8, the latter entails very little error in 
testing the significance of an obtained correlation even for N 
small. If one wishes to test the deviation of an r from some a priori 
value or to establish confidence limits for the universe r or to test 
the significance of the difference between r^s, the r to z transforma¬ 
tion is used for both large and small samples. 



228 


Small Sample Methods 


COMPARISON OF STANDARD DEVIATIONS AND OF VARIANCES 

For samples with N^s greater than 100, the standard error of a 
standard deviation can be obtained by formula (20) and used to 
determine the standard error of the difference between two stand¬ 
ard deviations. Then the obtained difference divided by this 
ffD gives a CR which is interpretable by means of the normal table. 
The sampling distribution of the CR of the difference between 
will not be normal when small samples are involved. We have 
already pointed out that the sampling distribution of a or the 
corresponding 5 is skewed when N is small. Despite this skewness, 
the sampling distribution of the difference between two s^s for 
small samples of jYi and N 2 cases drawn from the same universe 
(or two universes with the same standard deviation) will be 
approximately normal if Ni = iV' 2 , but even for this special case 
the ratio of the difference to its standard error as estimated from 
the samples will not have a normal sampling distribution because 
the variable denominator part of the ratio is skewed. When Ni 
does not equal iV’ 2 , the numerator or difference part is not normal 
in sampling distribution, so it is readily seen that the testing of 
the significance of the difference between o-^s or s^s involves more 
complications than the comparison of means. 

Rather than deal with the differences between s^s, it has been 
found more convenient mathematically to deal with the difference 
between their logarithms. Professor R. A. Fisher has developed 
the mathematics of the sampling distribution of a function widely 
known as z, which is defined as 

2 = loge Si - loge S 2 (93) 

If successive samples are drawn from a single universe or from two 
universes having the same variance, the sampling variation of z 
will center at zero and depend upon rii and n 2 , the two dfs. Note 
that the sampling distribution is independent of the universe value 
of the variance or standard deviation. In other words, we do not 
require an estimate of a standard error which uses information 
from the samples, as required for the standard error of the differ¬ 
ence between o-^s. Probability tables for the z function are 
available by which one can, for given d/’s, i.e., rii and 712 , find 
how large z must be for the .06, the .01, and the .001 levels of 
significance. 



Comparison of Standard Deviations and of Variances 229 

The z, defined by formula (93), has one disadvantage: loga¬ 
rithms must be used. Since (93) can be written in the equivalent 
form 

1 A 

z = - loge -j- (93a) 

2 §2 

it is seen that, instead of the difference between two logarithms, 
we have 2 as a function of the ratio of the two estimated variances. 
From the sampling distribution of one-half the log of a ratio, the 
sampling distribution of the ratio itself can be inferred. For 
ni = 5 and n 2 = 16, the value of 2 , which will be exceeded 1 
per cent of the time by chance (the .01 probability level), is .7450. 
This is one-half the log of the ratio of the two variances, and hence 
the log of the ratio would be 1.4900; by reference to a table of 
natural logarithms the antilog of 1.4900 is found to be 4.44. That 
is, as large a ratio as 4.44 would occur .01 time by chance. In 
order to avoid the necessity of using logs. Professor George W. 
Snedecor has developed tables for the variance raiioy which is 
defined as 

(94) 

A 

The equation f of the sampling distribution of F contains two 
n’s: ni for the df upon which is based, and 712 as the df for S 2 - 
This means that there is a sampling distribution curve of F for 
each possible combination of ni and 712 . The probability table 
for F must accordingly be entered with ni and 712 in order to learn 
what level of significance a given F reaches. To use Table F of 
the Appendix, we take the larger of the two variance estimates 
as the numerator in computing F, and the df for this larger esti¬ 
mate is symbolized as 7ii regardless of any system of subscripts 
that may have been used to designate the two groups. Thus the 
F that is used with the table is always unity or greater, even though 
the sampling distribution of F involves values less than unity. 
That is, if we were drawing successive samples from groups A 


t 


y ^ 


p{ni-2)/2 
' (niF + 



230 Small Sample Methods 

and B and each time took F as ^a/s^h regardless of which was the 
larger estimate, the sampling distribution of F would obviously 
involve values below unity as well as above unity. The table, 
however, is set up in terms of the greater-than-unity side of the 
sampling distribution. 

If one wishes to judge whether two samples, either large or 
small, yield a difference in variability which is large enough to 
warrant concluding that the two population variabilities differ, 
he sets up the null hypothesis that no difference exists in the two 
population variances. Then, instead of dealing as usual with the 
difference between the two estimates, he takes their ratio. Obvi¬ 
ously, the departure of this ratio or F from unity reflects or de¬ 
pends upon the difference between the two variance estimates. 
If the value of F, computed with the larger estimate in the numera¬ 
tor, is so large that it is not reasonable to believe it a chance devia¬ 
tion from a true value of unity, the null hypothesis is rejected, 
and it is concluded that the two populations do not have the same 
variance. If F is small, i.e., near unity, the null hypothesis is 
accepted. 

Now it happens that, although the F values given in Table F 
for the .05, the .01, and the .001 levels of significance hold for the 
major and very extensive uses of the F table to be discussed in 
Chapters 13, 14, and 15, these values are not applicable to the 
simple case where we wish to test the significance of the difference 
between the variabilities (variance estimates) for two groups. 
For this particular case, an F which falls at, say, the .01 level 
signifies that as large a difference in one direction would occur 
1 per cent of the time by chance. This is so because in placing the 
larger estimate in the numerator we are considering only one tail 
of the F distribution. In asking whether two variance estimates 
of, say, 10 and 25 based on two groups differ, i.e., lead to an F 
which departs significantly from unity (no difference), we should 
consider not only the probability of securing an F as large as 
25/10 but also the probability of obtaining one as small as 10/25. 
This, it will be observed, is exactly analogous to considering both 
positive and negative values for the z of formulas (93) and then 
raising the question as to the probability of obtaining on a chance 
basis as large a difference, irrespective of direction. If we had this 
last probability, we would halve it to obtain the P for one direction 
only; conversely, if we had an F which fell at the P == .01 level 



Limitations of t for Difference between Means 231 


in the table, we would need to double .01 to secure the probability 
for as large a difference irrespective of direction. In other words, 
for this particular case, that of testing the significance between 
the variability for two groups, an F at the .01 point of the table 
means significance at the .02 level; an F at the .05 level means 
significance at the .10 level; and an F at the .001 level indicates 
significance at the .002 level. We will not have to make this type 
of adjustment when we come to the principal uses of F in connec¬ 
tion with the analysis of variance. 

For example, suppose that 50.21 and 147.62 are variance esti¬ 
mates available for two samples of 8 and 9 cases respectively. 
The respective dfs would be 7 and 8. In computing F we have 
147.62/50.21 = 2.94, and Ui becomes 8, with n 2 = 7. Turning 
to Table F, we see that F would need to be 3.73 for the .05 level, 
which for this type of problem is the .10 level. Therefore the null 
hypothesis is not rejected. If we take the square roots of the two 
variance estimates, we get s^s of 7.09 and 12.15. By the F test, 
we are in effect saying that the difference between these two 
is not significant. As usual, this does not prove the null hypoth¬ 
esis—^it becomes acceptable because we cannot with sufficient 
certainty reject it. 

There are two reasons why one may wish to know whether two 
supposedly different groups or populations differ as to variability 
for a given trait. The interest may be in the variability per se 
because possible differences may have scientific or practical sig¬ 
nificance. The much-bandied question as to sex difference in 
variational tendency would be an example of this. The second 
reason is purely statistical and is closely associated with the ques¬ 
tion (next to be discussed) of when the t technique is applicable. 


LIMITATIONS OF i FOR DIFFERENCE BETWEEN MEANS 

As previously mentioned, one of the assumptions underlying 
the use of t in comparing means is that the two samples have 
been drawn from universes having the same variance. That this 
assumption is not tenable for a given batch of data will have been 
shown when the variance ratio, or F for the two estimates of 
variance, is sufficiently large to fall at or beyond the .01 level of 
significance, or beyond the .05 level if one chooses this less strin¬ 
gent level. Suppose, however, that F does not reach the .01 or 



232 


Small Sample Methods 


.05 level, does it follow that the necessary condition of equal 
population variance has been demonstrated? Obviously not. 
Take, for example, the variance estimates discussed above. We 
cannot conclude that the two population values differ, nor can we 
conclude that they are the same. One population variance could 
easily be as small as 40, the other as large as 200. Even if the 
two estimates happened to be the same, the two universe values 
could be quite different. 

Thus it is doubtful whether it can be demonstrated by small 
samples that the possible difference between two universe vari¬ 
ances is small enough not to invalidate the use of t in testing the 
significance of the difference between means. When this is cou¬ 
pled with the fact that it is likewise difficult on the basis of small 
samples to be sure that the requisite assumption of trait nor¬ 
mality holds, we see the necessity for extreme caution in drawing 
conclusions from small samples regarding difference between 
means. 

Suppose that in one study the difference between means for 
two small samples leads to a ^ which falls at the .01 level and that 
in another study two large samples yield means, for another trait, 
which are also significantly different at the .01 level. Can we 
place as much reliance on the first difference as on the second? 
The answer is yes, providing the two studies have been carried 
out with the same degree of care as regards controls and the use 
of adequate sampling techniques, and providing it is safe to pre¬ 
sume that the two fxmdamental assumptions underlying t are 
tenable. Thus our confidence in a result based on small samples 
is a function not only of the probability level of significance at¬ 
tained but also of our faith that the two basic conditions have 
been met. Since, as we have seen, the conditions of trait nor¬ 
mality and of homogeneity of variances are exceedingly difficult 
to demonstrate when the only information available is based on 
the small samples at hand, .we are forced to conclude that, in gen¬ 
eral, we cannot place as much reliance on the results from small 
samples as on those from large samples. 

This raises the question of the place of small samples in psycho¬ 
logical research, and about this there will be a diversity of opinion. 
We do not propose to settle the issue or even debate it; instead, 
we shall mention a few points which we feel are pertinent. There 
are, of course, types of research for which it is impossible or prac- 



Limitations of t for Difference between Means 233 

tically impossible to secure more than a few cases either because 
of their scarcity or because of prohibitive costs. For such situa¬ 
tions it is fortunate that the small sample or t technique, which 
permits some allowance for the smallness of the sample or samples, 
is available. Quite frequently small samples may be useful in a 
preliminary study which is carried out solely for the purpose of 
guiding the experimenter. If given hypotheses seem to be verified, 
then the next step should be to secure more cases for further 
verification rather than to rush into print with positive conclu¬ 
sions. 

It seems to us that those who publish statistical results based 
on a small number of cases should, unless they are positively sure 
that the basic assumptions underlying t have been met (and this 
assurance can seldom be attained), adopt a more stringent level 
of significance, say a P of .001 before drawing definite conclusions, 
and a P of .01 as the borderline of significance, i.e., as suggestive. 
Admittedly, a more stringent criterion of significance means that 
the null hypothesis may be less frequently rejected and conse¬ 
quently that a real difference may be overlooked. This general 
point has already been discussed in Chapter 5. Here we are 
arguing for a higher level (smaller P) for judging significance 
mainly because we need a higher significance level to overcome 
partially the loss of confidence entailed by ignorance of whether 
the requisite assumptions have been met. 

Aside from a needed safeguard against erroneously concluding 
from small samples that something is statistically significant, 
there is the additional fact that the larger sampling errors accom¬ 
panying small samples are not conducive to rejection of the null 
hypothesis unless the difference between the universe values is 
sizable. Let us suppose that the means for the heights of two 
populations are 64.5 and 68.0 and that the universe standard 
deviations are both equal to 2.7. An investigator who does not 
know these facts draws a random sample of 8 cases from each 
universe; and in order to help him a little (and also simplify this 
discussion), we tell him that each & = 2.7. Th e stand ard error 
of the difference between means becomes 2.7'\/^ + ^ or 1.35. 
If the investigator accepts the .01 level of significance, it is imme¬ 
diately apparent that an obtained difference would have to be at 
least (2.58)(1.35), or 3.48, for him to reject the null hypothesis. 
(Why are we justified in using the normal deviate, 2.58, with such 



234 


Small Sample Methods 


small samples?) A little consideration of the fact that the sam¬ 
pling distribution of differences between means will center at 
3.5 indicates that the chances are nearly 50-50 that the investi¬ 
gator will be accepting the null hypothesis even though the real 
difference is more than a standard deviation in magnitude. 

Our proposal that a higher level of significance be used in draw¬ 
ing conclusions from small samples would obviously increase the 
chances for accepting the null hypothesis. In the last analysis, a 
balancing of risks is involved, the risk of being in error when con¬ 
cluding that a difference exists vs. the risk of overlooking a real 
difference. Which risk one prefers may depend upon the for- 
seeable consequences of proclaiming a conclusion which may be 
in error. If significance is claimed, then the result might become 
the basis for new hypotheses or be seized upon as supporting some 
theory or be used in a social action program. It cannot be ex¬ 
pected that a given conclusion will either receive verification or 
be found wanting by some other investigator for the simple reason 
that psychologists and other social scientists are notoriously 
disinclined to make exact repetitive studies. If the original 
hypothesis is highly reasonable, yet does not receive support 
because t is not sufficiently large for rejecting the statistical or 
null hypothesis, it might be argued that the study is more likely 
to be repeated than when t is large enough to lead to “significance. 

There are times when an investigator may be so anxious to 
accept the null hypothesis that he will seize upon a very high level 
of significance in order to better his chances for accepting the 
hypothesis of no difference. Another way, as we have seen in 
the next to the last paragraph, for increasing the odds in favor of 
not rejecting the null hypothesis is to use exceedingly small sam¬ 
ples. Now those who desire to prove that no difference exists 
must face the simple fact that such a proposition can never be 
proved on a sampling basis. The most convincing way to demon¬ 
strate that a difference is of no practical or scientific significance 
is to use large samples and the confidence interval method for 
specifying limits for the population difference. 



CHAPTER 13 


Analysis of Yariance: Simple 


The F or variance ratio defined in the previous chapter is appli¬ 
cable in a wide variety of situations. The general requirement is 
that we have two independent estimates of variance, which esti¬ 
mates are, on the basis of the null hypothesis, regarded as esti¬ 
mates of the same population value. If F is sufficiently large, 
the null hypothesis becomes suspect, and thereby one draws a 
positive conclusion the nature of which depends upon the given 
situation. It is assumed that the trait or variable^ in terms of the 
measurement imits being employed, is normally distributed^ but 
there is some evidence that moderate skewness is permissible. 

It will be recalled that imder certain circumstances the correla¬ 
tion coefficient is interpretable in terms of the proportion of 
variance ‘‘explained.The idea is that variation can be broken 
down into component parts in such a way as to permit specifica¬ 
tion of the relative importance of the component sources. Back 
of this is the fact that variances are additive to a total variance, 
as shown when we derived formulas (37) and (37a), which are 
basic to the so-called variance theorem. Although this theorem 
is fundamental to the analysis of variance technique, it is not our 
aim to consider methods of estimating the proportion or percent¬ 
age of variance due to a given source but rather to discuss ways 
of testing whether a possible source is contributing to the total 
variance to a statistically significant degree. 

BREAKDOWN OF SUM OF SQUARES 

Let us begin with the simple situation in which the total varia¬ 
tion for a set of scores based on N individuals is possibly due in 
part to the fact that the total group is heterogeneous with respect 
to some factor, such as socioeconomic level or age or racial origin 

235 



236 


Analysis of Variance: Simple 


or type of treatment or method used in memorizing or varying 
level of illumination—^any factor which permits breaking down 
the total group into subgroups. In other words, the individuals 
or their scores can be classified into subgroups, or the total group 
can be regarded as made up of specified subgroups. For simplicity, 
let us assume that the subgroups are of the same size, say m cases 
per group, and that we have k groups. Let r stand for any sub¬ 
group; i.e., r takes on values of 1, 2, 3, • • •, k, and let the mean 
score for the groups be specified as Xi, X 2 , • • •, • • •, with 

X as the mean for all groups combined (total mean). Although 
it is possible to use a precise notation, such as to denote the 
score of any, the ith, person in group r, we shall in this chapter 
simply use X as the score for any individual. 

We are now in a position to write an individuaUs score as a 
deviation from the total mean in terms of the deviation of his 
score from his group mean and the deviation of the group mean 
from the total mean. Thus, for a score in group r, 

(X - J) = (X - Xr) + {Xr ~ X) (96) 

which indicates two sources of variation: the variation of a group 
mean from the total mean and the variation of an individuaUs 
score from his group mean. To see how the above equation or 
identity works, take the score of 16 in group 2 of Table 38. We 
would have 

(16 - 10.0) = (16 - 13.0) + (13.0 - 10.0) 
or 

6 = 3 + 3 

Similarly, in group 5 the score of 7 would lead to deviation values 
as follows: 

(7 ~ 10.0) = (7 - 9.5) + (9.5 - 10.0) 
or 

—3 =, —2.5 + —.5 

The score of 12 in group 2 would be expressed as 

(12 - 10.0) = (12 - 13.0) + (13.0 - 10.0) 
or 

2 « -1 + 3 

The scores are stated in deviation units so that we may specify 
variances or estimates thereof in terms of the sum of squared 



Breakdown of Sum of Squares 


237 


Table 38. Frequency Distributions to Illustrate the Expressing 
OF Scores in Terms of Deviations 



deviations. If we rewrite formula (95) specifically for group 1, 
we have 

{X - I) = {X- Ji) + - I) 

Squaring both sides gives 

(X - Z)2 = (X - Xi)2 + (Xx - X)2 + 2(Ji ~ X)(X - Ji) 

as the squared deviation, from the total mean, of any score in 
group 1. Each of the m persons in the group will have such a 
squared deviation score. We may indicate the sum of the squares 
for the m cases as 

S(X - X)2 = 2(X - Xi)2 + S(Jx - X)^ + 2(Xi - X)S(X - Ji) 

Note that in the last term the constants 2 and (Xi — X) have 
been taken from under the summation sign, and that 2(X — Xi), 




238 Analysis of Variance: Simple 

being the sum of deviations of a set of scores about their own 
mean, will be exactly zero. Therefore, the last term vanishes. 
Note also that the second right-hand term involves summing a 
constant, which is the same as multiplying it by the number of 
cases involved in the summation, i.e., S(Xi — X)^ = 

Thus we see that we may write the sum of squares (of devia¬ 
tions) for the first group and by analogy for the other groups as 
follows: 

1st group: S(X - Xf = S(X - X^f + m{Xi ~ Xf 

2nd group: S(X - Xf = S(X X 2 ? + ^(Xa - Xf 

rth group: S(X - Xf = S(X - Xr? + m{Xr - Xf 

fcth group: S(X - X)^ = S(X - X^? + m{Xk - Xf 

If we summed the left-hand parts of the foregoing, we would 
obviously have the sum of squares of deviations for the entire set 
of X = fcm cases. This summing of sums, or double summation, 
can be conveniently indicated by using two summation signs, or 
22 (X — X)^. We may sum the right-hand terms separately. 
The first term on the right involves summing sums, and the result 

r 

can be indicated symbolically by 22 (X — Xr)^, which implies 
that we first sum for each group, then sum over all groups. The 
first summation sign indicates that the subscript r takes in turn 
values running from 1 to ft. The sum of the other right-hand 

r 

terms can be written as m2(Xr — X)^. 

Since adding of equations leads to an equation, we have 

22(X - X)2 = 22(X - Xr? + m2(Xr - X)^ (96) 

as a means of expressing the fact that the total sum of squares 
(of deviations) can be broken down into two components, the 
first of which has to do with variation about group means, i.e., 
within groups, and the second of which involves variation of group 
means about the total mean, i.e., between groups. In other words, 
the total sum of squares is made up of two additive parts. If we 
divide both sides by N or km, we have the total variance broken 
into additive components; but for our present purposes we shall 
need imbiased estimates of variance, and hence it becomes neces¬ 
sary to divide through by degrees of freedom. 



Meaning of Variance Estimates 


239 


The correct df can be ascertained by examining the three sums 
of squares. For the total sum of squares we have one restriction, 
the total mean, and as seen in the previous chapter the df will be 
iV — 1 or fcm — 1. The within-groups sum is based on N or km 
squares, but since these are about k different means there are k 
restrictions, or km — k {= N — fc) degrees of freedom. The last 
or between-groups sum involves k means, varying more or less 
about the total mean; thus, aside from the m factor, it contains 
k squares with one restriction, and the df becomes fc — 1. In 
other words, the k means are analogous to varying scores, and 
obviously the mean of these means will equal the total mean. 

We may indicate the division of the three sums of squares by 
the proper dfs as follows: 

S2(X - i^(X - Xr? mk{Xr - 

km — 1 ^ km — ^ — 1 

Notice that we are no longer dealing with an equation. Why? 
Each division will result in a variance estimate, but these are not 
directly additive, which means that we cannot specify what pro¬ 
portion of th^ estimated total variance is due to the between- 
groups variation. The reader should note, however, that the 
dfs are additive: (km — 1) = (km — fc) + (fc — 1). 

Before examining the meaning of these three variance estimates, 
let us label them: for the estimate of total variance, for 

that based on the withinTgroups sum of squares, s^b for that based 
upon between groups. 

MEANING OF VARIANCE ESTIMATES 

In so far as one thinks of the total km cases as a sample drawn 
from one population, will be the best unbiased estimate of the 
variance of the population, If we think of the m cases for 
each of our k groups as samples from k possibly different popula¬ 
tions, then will be a composite estimate of the several popula¬ 
tion variances, a sort of average which makes sense if the popula¬ 
tion variances are equal; if the k groups have been drawn from 
just one population, this within-groups variance estimate or 
will differ little from, but be somewhat smaller than, Note 
that ^ and cannot be regarded as independent estimates be- 



240 


Analysis of Variance: Simple 


cause the two estimates are based on practically the same devia- 
tions: extreme scores, in either direction, will tend to make both 
and large. If m, or the number of cases per group, is taken 
larger and larger and if the groups are regarded as belonging to 
the same population or populations differing in some respects but 
having the same mean and variance for the given trait or variate, 
8^ and will tend to the same value, 

r 

Let us next look at The division of mL{Xr — by its 
df may be accomplished by dividing the sum factor by /b — 1. 
In making this division we are dividing a sum of squares by degrees 
of freedom; hence the result will be a variance estimate. Let us 
use as a symbol for this estimate. Then 


r 



In order to understand the meaning of we may regard our k 
means as a sample of sample means from an indefinitely large 
supply of possible sample means for groups drawn from the same 
population. The variance for this universe of sample means is 
given by the standard error of mean formula, i.e., 

If we were given the value of and told to determine the uni¬ 
verse trait variance or <r^, we would simply solve = &^/m 

for Thus, If we had only an estimate of 

such as 5^5^, we could use this estimate as a basis for estimating 
the trait variance; i.e., can be taken as an estimate of 
Since = s^h, we have s^h and (see previous paragraph) 
as estimates of the same population variance. 

These estimates should agree within the limits of chance, and 
being independent estimates of the same variance, the sampling 
distribution of their ratio is that of the F distribution. When an 
obtained F or ^h/^w is larger than expected on the basis of chance 
sampling, the implication is that greater than expected by 

chance. How could this come about? Let us suppose that our 
k groups of m cases each have been drawn from k different popular 
tions, i.e., from populations with means which really differ. Under 
this circumstance the variation of the k sample means will spring 
from two sources. A part of the variance of the means will be 
due to sampling variation predictable by the formula for the 
standard error of the mean on the basis of m and the trait vari- 



Meaning of Variance Estimates 


241 


ance. A second part of the variation in means will be due to the 
variation of the true (population) means of the k groups. If we 
let represent the variance of obtained means and the 
variance of the true group means, and if the several groups have 
the same population variance, (another assumption under¬ 
lying the variance technique), we should expect the following to 
hold exactly for an infinitely large number of groups and approxi¬ 
mately for a small number of groups: = &‘^/m + This 

is analogous to the commonly accepted expression used in connec¬ 
tion with test reliability, namely, that the variance of obtained 
scores equals the variance of true scores plus error (of measure¬ 
ment) variance. 

Multiplying the above by m, we have + ma^^oo- 

Thus, since m times the obtained variance of group means can 
be broken down into two components, it should be obvious that 
the estimate, may also be subject to two sources of variation. 

In practice we don’t have a priori knowledge of whether ma^Zoo 
is real or zero. What we have are two estimates of the population 
trait variance, that based on s^h (or and that based on s^w. 

If the s^h estimate is significantly larger than s^wf i-o., if F or 
is beyond the point for P = .01 level of significance, it 
can be argued that s^b involves a source of variation over and 
above that of random sampling errors in the means, and hence 
that rruT^^^ is real. This is, of course, equivalent to concluding 
that our m cases have been drawn from k groups with real differ¬ 
ences in their population means. 

Although the table of F requires that the larger of the two 
estimates be used as the numerator in computing the variance 
ratio, it should be noted that cannot be significantly larger 
than 8^b unless the operation of chance sampling has been restricted 
in some manner. In practical applications we are primarily and 
nearly always interested in the case in which ^b is the larger of 
the two estimates. If it is smaller than it is ordinarily not 
necessary to compute F, 

We may now summarize the foregoing. When we have scores 
on k groups of m cases each, the total sum of squares can be broken 
down into two additive parts, that for between and that for mthin 
groups. Dividing by the appropriate degrees of freedom, the 
within sum of squares gives as an estimate of the trait variance 
for the population, and s^b (= ws^sa) yields a second and inde- 



242 


Analysis of Variance: Simple 

pendent estimate of the same population variance. The sampling 
variation of the ratio of these two estimates is that of the variance 
ratio, Fj if the k groups belong to the same population. If 5^5 is 
significantly larger than which is an estimate of the popula¬ 
tion variance, s^h must be regarded as an estimate of the same 
variance plus variation due to real, nonchance, differences be¬ 
tween the k groups. 

If we let —> stand for *fis an estimate of,’^ then 
S^b 

The null hypothesis is that is zero, and rejection of this hy¬ 
pothesis because s^b/s^w is significantly large implies that 
not zero, or that the k groups have not been drawn from the same 
population (or from populations with equal means). In other 
words, we have a technique that provides an over-all test for the 
significance of the differences between several means considered 
simultaneously. 

Computational formulas. The required arithmetical labor 
can be shortened by resort to the general principle for computing 
the sum of squares of deviations inherent in formula (90) or (90a): 

S(X - r)2 = SX2 - ^ [XSX® - (SX)2] 

Thus we would have 

SS(X - r)2 = - [JVSSX^ - (SSX)2] (97o) 

N 

for total sum of squares, in which the double summation indicates 
that the siunming is over all groups. It can be shown by easy 
algebra that 

SS(X - Xr? = - [mSSX* - S(SX)2] (976) 

m 

for within sum of squares and that 

mSC^r - r)2 = — [jfeS(2X)* - (S2X)*] (97c) 

km 


for between sum of squares. 



Example: Differences between Several Means 243 


Accordingly, to compute the three sums of squares of devia¬ 
tions, we need to sum all the raw scores, SSX; sum the squares 
of all the raw scores, SSA^; and sum the squares of the separate 
group sums, S(SX)^. These sums can readily be obtained on a 
calculating machine by computing SX and separately for 
each group, squaring each SX, and then summing the several 
SX values for SSX, the SX^ values for SSX^, and the (SX)^ 
values for S(SX)2. 


EXAMPLE: TESTING THE SIGNIFICANCE OF DIFFERENCES 
BETWEEN SEVERAL MEANS 

To illustrate the application of the technique outlined above 
we shall use unpublished data of Wright * on massed vs. dis¬ 
tributed practice in the learning of nonsense syllables by the 


Table 39. Number of Syllables Correctly Anticipated at the 34th 
Minute of Practice 


Group 

1 

2 

3 

4 

5 

Rest interval (minutes) 

8 

3.5 

2 

1.25 

0 

Number of trials 

5 

8 

11 

14 

29 


5 

8 

9 

11 

17 


5 

7 

3 

12 

16 


1 

4 

9 

15 

18 


5 

4 

10 

11 

11 


,8 

7 

5 

10 

15 


' 1 

7 

11 

8 

9 


2 

5 

9 

13 

18 


2 

6 

6 

13 

13 


2 

8 

7 

5 

12 


8 

14 

6 

7 

15 


4 

8 

16 

11 

8 


1 

5 

12 

12 

13 


3 

1 

11 

12 

7 


4 

5 

15 

9 

15 


4 

8 

13 

16 

15 


2 

5 

4 

■7 

13 

m 

16 

16 

16 

16 

16 

SX 

67 + 

102 + 

146 + 

172 + 

215 « SSX - 692 

SX* 

279 -f 

768 -H 

1,550 + 

1,982 + 

3,059 - SSX* - 7,638 


T 

(2X)2 3,249 + 10,404 + 21,316 + 29,584 + 46,225 « S(2X)^ - 110,778 

Means 3.56 6.38 9.12 10.75 13.44 X - 8.65 


• Wright, Suzanne T., Spacing of practice in verbal learning and the matura¬ 
tion hypothesisy Unpublished Master’s Thesis, Stanford University, California, 
1946. 



244 Analysis of Variance: Simple 

anticipation method. The essential comparison is based on the 
amount of learning shown in 34 minutes by 5 (= /b) groups of 
16 (= m) cases each. The groups differed in length of rest inter¬ 
vals between trials and/or in the total number of trials, as indi¬ 
cated at the top of Table 39. The scores of all 80 subjects are 
included in this table, and the necessary sums are given at the 
bottom of the table, separately for each group. Summing across 
yields the required double sums. The group means are also given, 
although not actually needed in determining F, 

The sums of squares (of deviations) are obtained by substituting 
in formulas (97): 

SS(X - Xf = -^[80(7638) - (692)^] = 1652.20 

SS(X - Xr? = i^[16(7638) - 110,778] = 714.38 
mS(Xr - Xf = -^[5(110,778) - (692)2] ^ 937 g 2 

These sums of squares, along with the respective degrees of 
freedom and the resulting variance estimates, are conveniently 
arranged in Table 40, usually referred to as a variance table. Note 
that the sums of squares for between and within groups add to 
the sum for the total, which provides a check on the arithmetic 
involved in substituting in formulas (97). This does not check 
on the accuracy of the sums given in Table 39. Note also that 
the degrees of freedom add to the total df. 


Table 40 . Variance Table for Data of Wright 


Source 



Variance Estimate 

Between 

937.82 


234.46 = ah 

Within 

714.38 


9.53 = s’‘u, 

Total 

1652.20 

79 



The variance ratio, or F, becomes 234.46/9.53 or 24.60. With 
dfs of ni = 4 and 112 = 75, we refer to the table of F to learn 
whether 24.60 is larger than expected on the basis of chance. 
That this F is highly significant is immediately apparent when 










Example: Differences between Several Means 245 


we note that for the given dfs an F of about 5.2 is significant at 
the .001 level. With the between-groups variance estimate signifi¬ 
cantly larger than that for within groups, we can conclude with 
high confidence that the five sets of scores have not been drawn 
from the same population of scores, or that amount of time spent 
in practice is a real source of variation. This is, of course, equiva¬ 
lent to saying that the several group means considered simul¬ 
taneously differ significantly among themselves. 

In the illustration just given the groups can be arranged in 
order before any of the data are seen, and additional credence can 
be placed in the results because the means follow this ordering. 
It should be understood, however, that the variance technique 
does not presuppose an a priori ordering of the several groups— 
it is generally applicable for testing the significance of the differ¬ 
ences between group means regardless of prior considerations. 

If one had available only the CR or t techniques and wished to 
compare the means for 5 groups, it would ordinarily be necessary 
to compute t or CR for each possible difference, and 5 means would 
lead to 5 X 4/2 or 10 differences. Obviously, the variance method 
requires less computation, and furthermore it provides an over¬ 
all test of significance which is not subject to the fallacy inherent 
in singling out the comparison involving the largest obtained t 
or CR, a practice which is likely to capitalize on chance differ¬ 
ences. After and only after it has been found that the over-all F 
is significant, can one'safely use the t technique to test the signifi¬ 
cance of the difference between any two of the group means. 
When we do this, Sw is used for the s required in formula (91) or 
(91a). Thus, to check the significance of the difference between 
the means for groups 1 and 2 of the Wright data, we have 

6,38 - 3.56 2.82 

t = ■■ , — .— =-= 2.59 

S^53 ^ 1.09 

The variance estimate here used is based on 75 degrees of freedom; 
hence this t may be entered as a CR in the normal probability 
table. It is significant at the .01 level. Since group 1 differs still 
more from the remaining 3 groups, one would not bother to com¬ 
pute additional for comparisons involving group 1. Actually 
the testing of the means for nonadjacent groups would scarcely 



246 


Analysis of Variance: Simple 

be necessary, but note that, since the groups are of the same size, 
the t between any 2 means in Table 39 will involve the same 
denominator, 1.09, already used. The use of as the ^ for the 
t test is logical in that is based on all the available scores and 
hence is more dependable than an estimate based on just 2 groups. 

SPECIAL CASE OF F TEST WHEN m = 1 

If we had fc = 2 groups, the testing of the between-groups 
variance would appear to be much like testing the difference 
between 2 means. Let us examine this case by starting with the 
expressions for the sum of squares for 2 groups: 

1 st group: S(X - Xf = S(X - + m{Xi - Xf 

2nd group: S(X - X)^ = S(X ~ X^? + m{X 2 - X)^ 

Instead of using double summation signs, we may indicate the 
within-groups sum of squares as S(X — Xi)^ + S(X — ^ 2 )^, 
and the between-groups sum of squares as m{Xi — X)^ 
4 - m{X 2 — X)^. The respective dfs will be 2 m — 2 and 1. Indi¬ 
cating the division of the sums of squares by their d/^s, we can 
write the variance ratio as 

m(Xi - X)2 + m(X 2 - Xf 
1 

^ ” s(x - XiY + sex - X2? 

2 m — 2 

Since the number of cases for the 2 groups is the same, it is readily 
seen that the mean for one group will be exactly as far above the 
general mean (X) as the other group mean is below X, or that 
X will bisect the distance between Xi and X 2 ] therefore (Xi — Xf 
= (X 2 — Xf = -jCXi — ^ 2 )^. The numerator for F becomes 
(m/ 2 )(Xi — X 2 f- It will be noted that the denominator term, 
which defines is identical to the defined on p. 224 in con¬ 
nection with the t test. Accordingly, we may write 



247 


Groups of Unequal Size 

Dividing both numerator and denominator by m/2, we have 

(^1 - ^ 2 )" 


F = 


m 


the square root of which is 


Vf = 


m 


which is identical with formula (91a) for t; when fc = 2 or 2 groups 
are being compared, F = It can be shown that this is also 
true when the N*q or m^s for the 2 groups are unequal. In fact, 
it can be shown that, when rii = 1, the sampling distribution of 
F becomes the same as that for providing the estimate based 
on between groups, i.e., that based on 1 degree of freedom, is 
used as the numerator regardless of which of the 2 estimates is 
the larger. It is thus seen that the t test is a special case of the 
F test. Note that F involves the square of the difference between 
means; hence it provides a basis for judging whether a difference be¬ 
tween means, irrespective of direction, is significant (cf. pp. 230- 
231). The CR technique for comparing the means of 2 large sam¬ 
ples is also a special case of the more general F test. That is, when 
Til = 1 and 712 is not small, the square root of F is CR, interpret¬ 
able via the normal curve table (Table A of the Appendix). 


GROUPS OF UNEQUAL SIZE 

When the number of cases varies from group to group, we may 
let mi, m 2 , • • •, mr, • • *, mk stand for the several iWs. The sum 
of squares for the rth group would be written as 

S(X - If = S(X - Xr)^ + mriXr “ Xf 

and the double summation over all groups would be 

S2:(X - = SS(X - Xr)^ + XmriXr - X)^ 

which differs from formula (96) in that the varying m's must be 
left imder the summation sign in the last term. In specifying 



248 


Analysis of Variances Simple 

the degrees of freedom, we must replace hm by N, where N is the 
total cases for all groups. The respective d/’s become N — 

N — k, and k — 1. The computational formulas are changed to 

^, (SSX)2 

SS(Z - Xf = SSX2 - for total sum (98a) 

SS(X - Irf = SSX2 - S - - for within sum (986) 

ntr 

r _ „ ^ r (SX)2 (SSX)2 

'SfiiriXr — = S-for between sum (98c) 

nir N 

Note that the second term for the within sum (and the first for 
the between) requires that for each group the square of the sum 
of its scores be first divided by its m; then the several quotients 
are summed. An additional row would be needed along the bot¬ 
tom of Table 39 for these quotients if the m^s differed, or one 
might replace the (2X)^ row by (SX)^/mr values. 

A variance table (like Table 40) may be formed, and F taken 
to equal s^b/s^w as before. The same interpretation holds: if F 
is significantly large, i.e., if is significantly larger than 
the variation of the several group means among themselves is 
larger than expected on the basis of sampling; hence nonchance 
differences exist between the groups. The student who attempts, 
for the situation of unequal m^s, to reorient the logic leading to 
the idea that is an estimate of 6^ and that s^b is an estimate 
of plus a possible ma^xoo will encounter some difficulty. Suffice 
it to say here that, if s^b is significantly larger than it can be 
concluded that the component involving the variance 
zero. That is, when the groups have been drawn from popula¬ 
tions having different means, ^b may be larger than because 
of this additional source of variation even though it is not easy to 
regard this variation in terms of times a varying m. 

Thus the F technique may be applied as a test of the signifi¬ 
cance of the difference between two or more means based on large 
or small samples of equal or imequal size (per group) regardless 
of whether there is an a priori basis for arranging the groups in 
order. It might be said parenthetically that the scientific hypoth¬ 
esis being tested will specify the direction of differences if such are 



Testing the Significance of the Correlation Ratio 249 


expected. The two fundamental or underlying assumptions, 
which accordingly restrict the usage of the variance technique, 
are that the trait as measured be distributed normally in the 
population (or populations) sampled and that the trait variance 
be the same for the different groups. There is some evidence 
that moderate departure from normality and moderate lack of 
homogeneity regarding variances do not seriously disrupt the 
applicability of the technique. Unfortunately, it is not easy to 
give a definitive definition of ‘‘moderate,’^ nor is it easy with 
small samples to demonstrate that the two assiunptions are being 
satisfactorily met. 


TESTING THE SIGNIFICANCE OF THE CORRELATION RATIO 

If the definitions of the correlation ratio, rj (pp. 184-185), are 
reexamined, it is readily seen that for one variable the within- 
arrays variance is the same as the within-groups variance, the 
grouping being made on the basis of intervals on another variable. 
Also the variance of array means is the same as between-groups 
variance. We recall, however, that the correlation ratio, as de¬ 
fined, does not involve the idea of variance estimates. It should 
be rather obvious that, unless the between-arrays (groups) vari¬ 
ance is significantly larger than expected on the basis of sampling 
errors in the array means, a correlation ratio cannot be deemed 
significant. 

For purposes of exposition we shall outline the procedure for 
testing the significance of riyx, for which we shall use the simpler 
symbol rj. The grouping will be on the basis of the intervals on 
the X variable, and the required sum of squares will be in terms 
of Y, The sums of squares and their respective degrees of free¬ 
dom will be 

22(7 - F)2 = 22(F - Fr)2 + imr{7r - F)^ 

(V-l) (AT-*) (A:~l) 


for k arrays with varying number, mr, of cases per array. From 
the definition formula of the correlation ratio, we have 


V 


2 _ 


1 




ay 


^ 2 . 



250 Analysis of Variance: Simple 

which becomes, in the notation of this chapter, 


, , SS(K - 7r)yN 

rr = 1 -- 

2S(F - F)ViV 

Since N cancels, we see that the following holds: 


22(7 - Fr)® = (1 - 11^)22(7 - F)2 

= within sum of squares (99) 
From the alternate expression for ij we have 


which becomes 


which leads to 


2m,(Fr - F)VJ\r 
^2(7^^F)^ 


Xmr(7r - = 1?^S2(7 - F)2 

= between sum of squares (100) 

When we wish to divide the sum of squares of formula (99) or 
(100) by the proper df, we may choose either the left- or right- 
hand part as representing the sum of squares. Thus the between- 
arrays estimate may be written as 

, ,^22(7 _ F)2 

® k-1 

and that for within arrays as 

^ (1 - ,2)22(7 - F)2 

g __ ^ '■ 

N -k 


The ratio, F = ^b/^v>, may be written as 

^ ,222(7 - F)2/(fc - 1) 

(1 - ,2)22(7 - F)2/(J\r - k) 

- 1 ) 

(1 - ,2)/(iV - k) 



Significance of Linear Correlation 


251 


It is accordingly seen that for fixed d/'s the value of F, even though 
computed from the sums rather than from their equivalents in 
terms of 77 ^, can be thought of as depending upon the size of 77 ^; 
therefore a significant F indicates a significant correlation ratio. 

With the three sums of squares computed, we can readily deter¬ 
mine whether any correlation in the sense of the correlation ratio 
exists, and we also have the necessary sums for calculating 77 if it 
is desired to have this measure of the degree of correlation. A 
significant F does not, however, mean a high correlation ratio; 
with N large, a low 77 can possess statistical significance. 

The computation of the sums of squares is accomplished by 
means of formulas (98) with the X's replaced by T's. 


SIGNIFICANCE OF LINEAR CORRELATION 

An appreciable correlation between two variables which are 
linearly related implies that the slopes of the regression lines are 
not zero, which in turn implies that the variance of predicted 
values is large enough to have some kind of statistical significance. 
The variance technique may be used as a test of the significance 
of linear regression. 

Suppose that we develop the argument in terms of the regres¬ 
sion of y on X. We may write the linear equation for predicting 
y from X as y' = BX + A. If we think of this regression line 
as having been drawn on the scatter diagram, it can readily be 
seen that the deviation of any person's Y value from the mean 
of the y's can be expressed in terms of its deviation from the 
regression line (or predicted value) plus the deviation of the pre¬ 
dicted value from the mean of the y's: 

(F - F) = (y - F) + (F - F) 

in which y' will vary from person to person in accordance \vith 
his X score. If we square all such (y — F) deviations and sum 
over all cases, we get 

SS(y - F)2 

= S[(y - F) + (F - F)]2 
= S(y ~ F)^ + S(F ~ Yf + 2 S(y - F)(F - F) 



252 Analysis of Variance: Simple 

for which double summation signs are not needed for clarity even 
though the summing is over all cases. The last or cross-product 
term has to do with a possible relationship between predicted 
values and residuals, but, as was shown in Chapter 7, this correla¬ 
tion is always zero, and hence this last term vanishes. 

Therefore the sum of squares can be broken down into two 
components: residuals or within arrays about the regression line 
and a part depending on the variation of the predicted values 
about the mean. If the correlation between X and Y were zero, 
this latter component would be zero because one would predict 
F for all cases. The departure of this sum of squares or of a vari¬ 
ance estimate based thereon from zero might lead one to conclude 
that real correlation exists in the population being sampled if it 
were not for the fact that sampling errors ordinarily operate so 
as to prevent the obtaining of zero correlation. 

Before attempting to understand the operation of chance sam¬ 
pling, we should consider the degrees of freedom associated with 
the sums of squares. As usual, the total sum of squares is based 
on iV — 1 degrees of freedom. The df for S(F — 7')^ may not 
be immediately obvious, but note that, if iV = 2 and variation 
exists for both X and Y, the regression line would necessarily 
pass through the two points defined by the pair of scores, r would 
be unity, and 2(F — F')^ would be zero. In other words, with 
AT = 2, there is no freedom for deviation from the regression line. 
From this it would be inferred that N needs to be reduced by 2 , 
or that d/ = iV — 2 , a deduction which is consistent with the fact 
that, in fitting a straight line, two constants are determined from 
the data, and hence two restrictions are imposed on the N devia¬ 
tions of the type (F — F'). 

Since the dfs for the component sums of squares are additive 
to that for the total, one can determine the df for the regression or 
S(F' — F)^ term by subtracting the df for residuals from that 
for the total: (V — 1) — (V — 2) == 1 as the df for the regression 
term. But determination of a df by subtraction does not permit 
the additive check on the correctness of the dfs which is possible 
in case each df is ascertained separately on the basis of some 
principle. By what principle could one determine that for the 
regression sum of squares the proper d/ is 1? The value of 
S(F' — F)^ will not be changed by shifting from gross scores to 
deviation scores, i.e., by moving the origin to the intersection of 



253 


Significance of Linear Correlation 

X and F. It will be recalled that the regression equation in devia¬ 
tion units is 2 /' = hx (where 6 = S of the gross score form), and 
accordingly we may write 

2(7' - Yf = 2(2/' - 0 = 2(2/' - 0)^ = 2(6x)2 = 

which permits us to examine the source or sources of variation in 
the regression sum of squares. Its value depends upon and 
but the value of 2 a;^ does not depend upon the degree of 
correlation. For a fixed set of X's, the freedom of 2(7' ~ F)^ 
to vary springs from 6 , i.e., from one value; therefore the df is 1 . 
A slightly different way of considering the question is to note 
that, since h = r {ayfax) and 2 x^ = Na^xf the sum of the squares 
of the predicted values can be written as 

2 

S(r' - F)2 = = Nr^a% 

from which it can be argued that, since the variation in predicted 
values is a function neither of N nor of the variance of the trait 
being predicted, it is a function of one value, the degree of corre¬ 
lation. 

Now let us return to a brief consideration of sampling or of the 
meaning of the variance estimates which result from dividing the 
sums of squares by their dfs. On the basis of the null hypothesis, 
that the degree of linear correlation is zero for the population 
being sampled, the regression line for the population would pass 
through 7, with zero slope or parallel to the x axis. Hence 
{Y — Y') will equal (7 — F), and the variance of the residuals 
will equal the total variance of the 7's. A sample from the popu¬ 
lation will seldom yield zero correlation (or zero regression), and 
therefore the residuals will tend to be somewhat reduced, or 
2(7 - 7')2 will tend to be less than 22(7 - Yf. It can be 
shown that 2(7 — Y')^/{N — 2 ) gives an unbiased estimate of 
the population variance when no correlation exists in the 
population. 

That the estimate based on the regression sum of squares, 
2(7' — F)^, divided by df = 1, is also an unbiased estimate of 
the same population variance may not seem plausible, nor is it 
easily explained in an elementary treatment. For any sam¬ 
ple, 2(7' ~ F)^ equals the difference between 22(7 — F)^ 



254 


Analysis of Variance: Simple 


and 2(y — Y')^, and it can be demonstrated that on the 
average the value of SS(F — F)^ — 2(7 — 7')^ will equal 
SS(7 — 7)^/{N — 1), or that the mean value of 2(7' — F)^ 
for successive samples will be 22(7 — 7)^/{N — 1). Since the 
latter is an unbiased estimate of the population variance, it fol¬ 
lows that 2(7' — F)^/l must be an estimate of the same variance. 

Of the three variance estimates, only the estimates based on 
residuals and on regression are independent. The sampling dis¬ 
tribution of their ratio is that of F. Let s^r stand for the esti¬ 
mate based on the residual sum of squares and s^p stand for the 
estimate based on predictions by a linear regression function. 
Then, if s%/s^r 9 with ni = 1 and % = iV^ — 2, falls at or beyond 
the .01 level of significance, the null hypothesis becomes suspect. 
This means that the s% estimate is larger than expected on the 
basis of sampling, from which it may be inferred that regression 
is a real source of variation in 2(7' — F)^, i.e., that the slope of 
the regression for the population is not zero, or that some correla¬ 
tion exists. 

We have already noted that 

2(7' - F)2 = WrVy 


Since 2(7 — 7')^ divided by N equals the error of estimate vari¬ 
ance, previously proved to equal — r^), it follows readily 
that 


Accordingly 


2(7 - 7')2 = Nil ~ r^)a\ 

2(7' - F)^ _ WrVy 
“ 1 ” 1 


2(7 - 7')2 Nil - r^)a\ 
W-2 W- 2 


Therefore 

WrVy/l 

^ ~ N(l- r^)0^y/(N - 2) “ (1 - r2)/(JV - 2) 


which is the square of the t given earlier (p. 226) for testing the 
significance of r. Thus, again we have F = when rii = 1. 

The reader will have noted that, since the required sums of 
squares and the resulting F can readily be expressed in terms of 



Testing Linearity of Regression 


255 


r, there is no need to worry further about a computational scheme 
for securing the sums of squares. The easier thing to do is simply 
to compute r. After that is done, either the F or the t test may 
be used for judging whether the correlation is significant. This 
discussion of the linear correlation problem here should help the 
student appreciate the generality of the analysis of variance 
technique and should also provide him with relevant concepts 
for understanding the test for curvilinearity of regression, to 
which we now turn. 


TESTING LINEARITY OF REGRESSION 

We have seen that the correlation ratio is a general measure of 
the degree of correlation and that r measures the degree of linear 
relationship. Even though the regression of T on X for a popula¬ 
tion be exactly linear, it will be found for a sample that the means 
of the arrays will show some deviation from a straight line; hence, 
as previously pointed out, the correlation ratio will tend to be 
larger than r. How large should the difference between rj and r be 
before one suspects nonlinearity, or how much can the array means 
deviate from a straight line by chance? Before the development 
of the analysis of variance technique, the inadequate Blakeman 
criterion was used to answer the foregoing. In presenting the 
currently accepted method, we shall carry the argument through 
on the basis of the regression of Y on X, 

Imagine a scatter diagram with regression line drawn and the 
array mean located in each vertical array. For a score in the rth 
array, the deviation of Y from F can be thought of in terms of its 
deviation from the array mean, Fr, plus the deviation of the array 
mean from the predicted value, F'r, plus the deviation of the pre¬ 
dicted value from the total mean. In symbols, 

(F - F) = (F - F,) + (F, - Y'r) + (FV ~ F) 

Squaring and summing for the rrir cases in each arr^ and then 
summing over all k arrays (equivalent to summing over all groups), 
we have 

SS(F - F)2 

= SS(F - 7rf + STOr(F, - Y'rf + 



256 


Analysis of Variance: Simple 


the cross-product terms having vanished because the component 
parts are uncorrelated. 

The first component is a sum of squares based on within-array 
variation with N — k degrees of freedom. We eqpountered this 
in checking the significance of the correlation ratio, and we then 
labeled as 8^w the variance estimate based thereon. 

The second sum involves deviations of array means from linear 
regression. Its df will be A; — 2 since there are k means and two 
restrictive constants in F'^. If /b = 2, the two means cannot 
vary from the fitted line. Let us use as a symbol for the vari¬ 
ance estimate based on this sum of squares. 

The third sum, which has to do with the part of the total vari¬ 
ance predictable by means of linear regression, is very similar to 
that occurring a few pages earlier in connection with the F test 
of the correlation coefficient. It differs only in that the same 
value is predicted for all cases within an array regardless of their 
location in the X interval defining the array. This is equivalent 
to a linear prediction of the mean of the array. Actually, the 
numerical value of S(F' — F)^ as calculated by ArVy, which 

equals r^SS(F — F)^, will be the same as Smr(F'r — F)^ com¬ 
puted directly, provided r was originally determined from a scat¬ 
ter diagram with the same intervals now being used to define the 
arrays. We have already seen that the df for this sum is 1, and 
we have used s^p as a symbol for the estimate based thereon. 

It will be recalled that, in the scheme for testing the significance 
of the correlation ratio, the total sum of squares was broken down 
into a within-array and a between-array part. We now have a 
breakdown into within array (as before) plus two additional 

r 

parts—the sum S?rer(F, — F)^ is broken into 

imr(7r - Y'rf + - F)2 

It will also be recalled that 

Xmr(Fr - F)2 = ,,222(7 _ 7)2 

and that 

STO,(Y'r - F)2 = T^SSCr - F)2 



Testing Linearity of Regression 


257 


By subtraction, we see that the new sum, Smr(Fr — Y'rf, is 
equivalent to (jj^ — r^)S2(7 — Y)^. 

For convenience, we shall now assemble in an analysis of vari¬ 
ance table the several symbolic expressions having to do with 
testing the significance of (1) the correlation ratio, (2) the linear 
regression coefficient, and (3) nonlinearity of regression. Table 41 


Table J^l. Analysis op Variance Functions for Bivariate Correlation 


Source of 
Variation 

Sum of Squares Equivalent 

d/ 

Esti¬ 

mate 

(a) Linear 

r 



regression 

ZmrCY’r - 7)* = r2S2(r - 7)^ 

1 

8 p 

(b) Deviation 




of means 

r 



from line 

Zmr(Tr - y'rf = (»* “ >^)ZZ(Y - 7)^ 

k - 2 

S“d 

(c) Between- 

r 



array means 

Zmr(7r - 7)2 = ,*22(F - 7)* 

k^l 

8 % 

(d) Within 

r 



arrays 

22(r - 7r)2 = (1 - T,2)S2(y - 7)2 

N -A: 

O to 

(e) Residual 

r 



from line 

22(7 - 7'r)2 = (1 - r2)22(y - 7)2 

V -2 


(/) Total 

22(7 - 7)2 

N - 1 



gives the sources of variation, the sums of squares and their 
equivalents in terms of r or the degrees of freedom, and a sym¬ 
bol for each of the variance estimates. Note, in review, that for 
the sums of squares, their equivalents, and the dfs, the following 
additions hold true: 

(a) + (6) = (c) 

(a) + (e) = (f) 

(c) + (d) = (f) 

(a) + (6) + (d) = (f) 

The several useful and permissible F% or ratios of independent 
and unbiased variance estimates, along with the proper dfs (rii 




258 Analysis of Variance: Simple 

and 712 values) for entering the table of F, may be stated in sum- 
maiy form: 


Fi = ^b/^v,; 

m == k — If 

II 

-k: 

significance of cor¬ 
relation ratio 

II 

Go 

ni = 1, 

ria = N 

-2: 

significance of lin¬ 
ear correlation 

Fa = sWu>-, 

ni = /b — 2, 

ria = N 

-k: 

significance of cur- 


vilinearity 


We have already discussed the first two of these F^s. If we 
write the third in terms of sums and d/’s, we have 

F _ - Y\f/(]c - 2) 

® SS(y - Yr?/iN - k) 

(n* - r2)SS(F - 7f/{k - 2) 

” (1 - ij2)ss(r - 7f/{N - k) 

iv^ - r^)/(k - 2) 

(1 - ,2)/(iSr - k) 

which indicates definitely that its value, for given d/’s, is a reflec¬ 
tion of the difference between the correlation ratio and the correla¬ 
tion coefficient. Therefore, in testing the significance of the 
variation of array means from linear regression, we are testing 
the significance of the difference between rj and r. If Fz falls 
beyond the .01 probability level, the hypothesis of linear regres¬ 
sion for the population being sampled is rejected. When this 
happens, it follows that the correlation coefficient and a linear 
regression function for Y bn X are not appropriate measures to 
use in describing the relationship. 

If one is also interested in testing the significance of the corre¬ 
lation ratio for X on F and the linearity of the horizontal array 
means, the analysis is carried through with X’s substituted for 
F’s. Since the niimber of grouping intervals on the two axes 
need not be the same, the value of k may differ for the two analyses. 



Illustrative Problem: r, i], and Curvilinearity 259 


ILLUSTRATIVE PROBLEM: r, 11, AND CURVILINEARITY 

The foregoing three tests of significance and the computations 
necessary thereto may be illustrated by the data of Table 42, 
which gives the bivariate distribution for the relationship between 
initial (sum of scores on trials 1-4) and final (trials 67-70) per¬ 
formance on the Koerth pursuit rotor. Since it is logical to be 


Table 4^, Bivariate Scatter for Initial and Final Scores op 92 Boys 
ON Koerth Pursuit Rotor 






X 

■s Initial Score 





V 

Final 

Code 










Score 

0 

30 

60 

90 

120 

150 

180 

210 

fv 

740 

11 




1 





1 

700 

10 


1 

2 

1 

1 


2 

2 

9 

660 

9 

1 

1 

1 

4 

3 


1 

2 

13 

620 

8 

2 

8 

2 

2 

2 


1 


17 

580 

7 

3 

3 

7 

1 

1 



1 

16 

540 

6 

2 

8 

5 






15 

500 

5 

2 

5 

3 

1 





11 

460 

4 

3 

1 







4 

420 

3 

2 








2 

380 

2 










340 

1 

3 








3 

300 

0 

1 








1 

fx 

mr 

19 

27 

20 

10 

7 

0 

4 

5 

92 - A 

2F 


89 

181 

139 

85 

60 

0 

37 

45 

636 

2r2 


547 

1209 

1007 

747 

520 

0 

345 

411 

4846 

(snVm, 

416.89 

1213.37 

966.05 

722.50 

514.29 

0 

342.25 

405.00 

4580.35 


concerned with the prediction of final from initial score, or the 
regression of Y on X, we shall be dealing with variations on the 
Y variable. 

In the first place, the correlation coefficient is computed from 
the scatter diagram by the method given in Chapter 6. Its value 
of .5687 is about .01 lower than the coefficient computed from a 
scatter with twice as many intervals. The use of so few intervals 
for the X variable would obviously not be recommended for the 





260 


Analysis of Variance: Simple 


computation of r, but in this illustration it is convenient because 
of page-space limitations. There is the additional consideration 
that for computing the correlation ratio one should avoid having 
too few cases per array, which if the sample is small may mean 
only a few intervals on the independent variable. At least 12 
intervals should be used for the dependent variable. In checking 
on linearity, it is necessary that we calculate r from a scatter 
with the same grouping intervals used in computing rj, and no 
corrections for grouping error are needed. 

For the computation of the correlation ratio and for the testing 
of its significance, we need the within arrays, the between arrays, 
and the total sum of squares. These may be computed from 
coded scores (deviations from an arbitrary origin in terms of step 
intervals), and the entire analysis may be carried through on the 
basis of coded scores, so that cumbersomely large figures are 
avoided. The reader who wishes to follow the computational 
procedure will need to note the following features of Table 42. 
The marginal frequencies on the right are for all the Y scores, 
and the fxS along the bottom margin are the rrirSj or cases per 
array. For each vertical array and for the right-hand margin, 
SF and SF^ are computed in terms of coded values (these corre¬ 
spond to Sd and HdP of Chapter 3). Summing across the SF 
and SF^ rows should yield the SF and SF^ obtained from the 
marginal distribution. For this problem, SSF = 636 and SSF^ 
= 4846. The last row, containing the several values of (SF)^/mr, 

’’ (SF)2 

is summed across for the needed S-, which is 4580.35 in 

TTlr 

this example. There is no check on this figure by calculations 
based on the margin. 

In order to get the sums of squares of deviations, the values 636, 
4846, and 4580.35 are substituted in formulas (98) with X re¬ 
placed by F. 

o 636 ^ 

SS(F - = 4846 -= 449.30 


SS(F - F,)2 = 4846 - 4580.35 = 265.65 


imr{7r - F)2 = 4580.35 - 


6362 


183.65 


92 



Illustrative Problem t r, i], and Curvilinearity 261 

By formula (100) we now obtain 
, 183.65 

,2 --.40874; = .639 

449.30 

which is the correlation ratio for Y on X. 

The other sums of squares called for in schematic Table 41 may 
be calculated from their equivalents in terms of and/or jj*. 
Note that r® = .56872 = .32342. 

SwirCY', - F)2 = (.32342)(449.30) = 145.31 

SS(y - Y'rf = (1 - .32342) (449.30) = 303.99 

imr{7r - Y'rf = (.40874 - .32342) (449.30) = 38.34 

The several sums of squares and their respective degrees of free¬ 
dom are set forth in Table 43, which contains also the variance 
estimates obtained by dividing the sums of squares by their d/’s. 
From these variance estimates, we have the following: 


Table 4S. Analysis of Variance Table for Regression of Final (F) 
ON Initial Score for Data of Table 42 


Source 

Sum of Squares 

df 

Variance Estimate 

Linear regression 

145.31 

1 

145.31 =s2p 

Deviation of means from line 

38.34 

5 

7.67 = i?d 

Between-array means 

183.65 

6 

30.61 = s\ 

Within arrays 

265.65 

85 ' 

3.13 = 

Residual from line 

303.99 

90 

3.38 = S*r 

Total 

449.30 

91 



For testing the significance of the correlation ratio we have 
Fi = 30.61/3.13 = 9.8, which for ni = 6 and n 2 = 86 is highly 
significant. The .001 level of significance requires an F of about 
4.0. 

For testing the significance of linear correlation, i.e., r, we have 
F 2 = 145.31/3.38 = 43.0, which for ni = 1 and n 2 = W is like¬ 
wise highly significant, the .001 level being at an F of about 11.6. 

For testing linearity of regression, i.e., the departure of the 









262 Analysis of Variance: Simple 

array means from a straight line, we have == 7.67/3.13 = 2.6, 
which for ni = 5 and n 2 = 85 is near the .05 level of significance. 
Thus the apparent departure from linearity in Table 42 is not 
sufficiently great to lead to rejection of the h 3 rpothesis of linearity; 
one would, however, question the hypothesis. This is an example 
of borderline significance which calls for drawing another sample 
or adding more cases before one sets forth a conclusion. For the 
problem at hand, a second sample of 90 boys yields a scatter 
diagram much like that of Table 42, so we would reject the hy¬ 
pothesis of linearity of regression. 

The student should keep in mind that the test for linearity can 
lead to the definite conclusion that the regression is curvilinear 
(if F is large enough), whereas a low F does not prove linearity. 
Why? 

If the hypothesis of linearity is disproved, it follows that the 
correlation coefficient is not a suitable figure for describing the 
relationship. The correlation ratio can be used to describe the 
degree of association, but the form of the relationship should be 
described by a fitted curve or by a verbal description of the gen¬ 
eral curve tendency of the array means. Some readers will have 
noted that the correlation ratio cannot be considered very descrip¬ 
tive of the data of Table 42 because of heteroscedasticity. As a 
matter of fact, the lack of homoscedasticity may also mean that 
our analysis of variance test for linearity is subject to question in 
that the assumption of homogeneity of variance is violated. The 
possible extent and direction of the error due to this failure of 
the groups, as defined by intervals on the x axis, to exhibit like 
variances cannot be specified, but it is doubtful whether the error 
is serious. 


APPLICATION TO MULTIPLE CORRELATION 

The reader may recall that the methods given in Chapter 9 for 
judging the significance of the multiple correlation coefficient 
involved unsatisfactory approximations. In so far as we are 
interested in testing the deviation of a multiple r from zero, the 
analysis of variance technique provides an exact test which is 
applicable when the sample is either small or large. 

Let us suppose that 7 is a dependent variable which is to be 
predicted by a multiple regression equation containing m inde- 



Application to Multiple Correlation 263 

pendent variables designated by The prediction equation 

may be written as 

Y' = A + BiXi + B 2 X 2 H— * + BmXm 

in which the J5^s are the regression coefficients. The deviation of 
any individual's Y score from the mean Y can be expressed as 
the sum of two parts: the deviation of his Y from his predicted 
value plus the deviation of the predicted value from the mean 
of the y's, thus, 

(7 - F) = (F - 7') + (7' - F) 

If we square both sides and sum over all cases, we have 

SS(7 - Yf = S(7 - 7')^ + 2(7' - F)2 

which is exactly analogous to the breakdown used in connection 
with the test of the linear correlation coefficient. One part has 
to do with residuals about the regression 'plane, the other with 
variations in the predicted values. The cross-product term again 
vanishes—^it can be shown that there is no correlation between 
residuals and predicted values. 

As previously, we label the 2(7 — 7')^ as the residual sum of 
squares and 2(7' — F)^ as the regression sum of squares. The 
total sum of squares will, of course, have N — 1 degrees of free¬ 
dom. The residual sum of squares will lose d/'s according to the 
number of constants'in the regression equation. We have the 
constant A, and the number of B constants is m; hence df = N 
— (m + 1 ) = iV — m — 1 for the residual term. The reader 
who does not immediately see the reasonableness of this should 
consider the case of 1 dependent and 2 independent variables 
with varying scores on iV = 3 cases. Imagine that the 3 scores 
for each case can be used to locate a point for each in three- 
dimensional space, and then think of fitting an ordinary plane 
to these 3 points. Obviously, the plane can be made to pass 
through all 3; hence the prediction would be perfect, and there 
would be no freedom for any of the 3 points to vary from the 
plane. That is, with iV = 3 (and with variation on all 3 variables), 
the multiple derived therefrom must be unity. 

Now, as to the df for the regression or prediction sum of squares, 
we note that for a fixed set of values for the X's the variation of 
this term must depend upon the slopes of the regression plane or 



264 


Analysis of Variance: Simple 


upon the JB’s. There being m B% there are m ways in which 
this sum can vary; therefore df = m. This is, it will be noted, 
an extension of the argument used to explain why df = 1 for 
testing the linear correlation coefficient. If our df determinations 
are correct, we should have (V — m -■ 1) + w adding to iV — 1, 
which is seen to be the case. 

In Chapter 9 it was pointed out that the multiple correlation 
coefficient can be defined as 


in which 0 ^ 1-23 ••• represents the residual variance and a^i is the 
variance for the dependent variable. Since the residual variance 
plus the predicted variance adds to the total, the multiple r can 
also be expressed as the ratio of the predicted to the total variance. 
(Note that we are here speaking of variances, not estimates.) 
By defiinition, the residual variance is 2(7 — Y')^/N, the pre¬ 
dicted variance is S(7' — T)^/N, and the total variance is 
SS(7 — T)^/N. We may therefore write the multiple correla¬ 
tion coefficient, using R in order to avoid subscripts, as 

^ 2(7 ~ r)VN 


= 1 - 


22(7 - 7)VN 


from which it is readily seen that 

2(7 - 7')^ = (1 ~ i2^)22(7 - F)^ 

From the alternate way of regarding multiple correlation, we have 

_ s(r - 7)^/n 
22 (r - YfJN 

which leads to 2(7' - Yf = B2sS(7 - Yf. 

Thus the sums of squares have their equivalents in terms of 
Rj and consequently they may be computed by way of R, The 
computation of these sums directly would be a hammer-and-tongs 
approach which would involve the laborious task of predicting 
by means of the regression equation the 7 for each individual. 

The foregoing may be assembled in a schematic variance table, 
like Table 44. As in testing the significance of the ordinary corre¬ 
lation coefficient, we set the null hypothesis to the effect that the 
estimate based on the regression sum of squares will differ from 



Application to Multiple Correlation 


265 


Table 44> Variance Setup for Testing Significance of Multiple 
Correlation Coefficient 


Source 

Sum of • 1 i. 

« Equivalent 

Squares 

df 

Esti¬ 

mate 

Regression 

Residual 

2(y' - ?)* = ft22S(F - F)* 

S(y - y')® = (1 - fi=*)S2(y - F)* 

m 

iV — m — 1 


Total 

22 (y - F)2 

N - 1 



that based on the residual sum only because of chance sampling 
errors. The null hypothesis implies that, if the entire population 
were measured, the correlation of the dependent variable with 
each independent variable would be zero. Now, when a sample 
is drawn from such a population, the r^s will vary more or less 
from zero with the result that the multiple R will likewise differ 
from zero. If the conditions of the null hypothesis hold true, 
the sampling distribution of s%/s^r follows that of the F distribu¬ 
tion with appropriate degrees of freedom. Note that 

s% S(F' - F)Vm 

” S(y - r)y{N - m - 1) 

- 7f/m 

” (1 - 222)22(7 - Yf/iN - m - 1) 

R^/m 

“ (1 -iJVCiV-m - 1) 

hence F is a, ratio which depends upon R and the dfs. If the 
numerator is less than the denominator, we may conclude without 
reference to the table of F that R is insignificant. When the 
numerator is the larger, one judges the significance of F by enter¬ 
ing the table of F with rii = m and n 2 = iV — m — 1. Once R 
has been computed, the calculations involved in checking its 
significance are so simple that an example would be humdrum. 

In the chapter on multiple correlation, it was pointed out that 
R as computed tends to have a positive bias, the extent of which 




266 


Analysis of Variance: Simple 


could be judged by formula (75). This formula can readily be 
derived by the use of estimated residual and trait variances in 
place of actual variances in formula (70). Best or unbiased esti¬ 
mates lead to an luibiased 72, or provide an unbiased estimate of 
the population value of 72. Formula (75) gives this improved 
estimate, but the improvement is negligible except when N is 
small, or when m is large relative to N. It should be stressed that 
neither the analysis of variance check on the significance of 72 
nor the improved estimate of 72 allows for the fallacy involved in 
multiple correlation work when from among a large number of 
variables a few are chosen for inclusion in the analysis because 
they show correlation with the criterion. Such selection tends 
to capitalize on r’s which are among the highest partly because of 
chance errors. 

A practical question of considerable importance arises when 
one wonders whether the inclusion of additional variables in the 
multiple regression equation leads to a significant increase in the 
accuracy of prediction or when one wishes to know whether the 
dropping of certain variables results in a significant decrease in 
the amoimt of variance predicted. The inclusion of additional 
variables in the equation always tends to reduce the error of esti¬ 
mate somewhat and leads to an increase in 72. Can it be said 
that the increase in 72 possesses statistical significance? 

Let 72i be the multiple based on mi independent variables and 
722 be the value based on m 2 variables selected from among the mi 
variables. To test the significance of the difference between 72i 
and 722, we take 

^ (R\ - R^)/(mi - m2) 

(1 - R\)/iN -mi-i) 

with ni = mi — m 2 and n 2 = N — nii — 1. If F falls beyond 
the .01 point, we can safely assume that the apparent gain in 
using the additional variable or variables possesses statistical 
significance. 



CHAPTER 14 


Analysis of Variance: Complex 


In the previous chapter an explanation of the fundamental idea 
of the analysis of variance technique was attempted, and applica¬ 
tions to relatively simple situations were given. In general, these 
situations involved the testing of the significance of the over-all 
variation of the means for several groups, the groups differing on 
the basis of a single classificatory principle. Such setups are 
sometimes referred to as single variable experiments, by which 
is meant that groups differing in one known respect are compared 
on a dependent variable. For example, income might be con¬ 
sidered a variable which is dependent in part on amount of educa¬ 
tion, which accordingly becomes the independent, single variable 
for classifying individuals into groups. Or it might be that the 
classificatory variable is subject to experimental manipulation, 
and we wish to determine whether variations thereof will lead to 
performance or response differences. The Wright experiment 
cited in Chapter 13 is an example of this. 

There are times when it is not only feasible but advisable to 
design the experimental setup so as to make one set of data serve 
for the testing of hypotheses regarding the separate influence of 
two or more independent variables. This type of thing has been 
done for a long time in psychological research wherein it has been 
possible to classify a total group first one way, then another, and 
perhaps a third way. For example, in order to determine some 
of the possible correlates of measured intelligence, we may classify 
a group of children into urban, suburban, and rural groups; then, 
ignoring this basis for grouping, we may classify them as to occu¬ 
pational level of father; or the classification may be by sex or by 
grade location or by age. Such a procedure in which one variable 
is considered at a time is tantamount to the single variable setup, 

267 



268 Analysis of Variance: Complex 

even though the same batch of data is made to answer questions 
about the effects of different independent variables. 

Now it is obvious that, in studying factors associated with 
intelligence, we could make a double classification by classifying 
our cases simultaneously on two of the variables, or a triple classi¬ 
fication by using three variables, etc. Consider for the moment 
a double classification based on the three rural-urban categories 
and on sex. This would lead to the assigning of the cases to six 
groups, each of which would have a mean IQ. Instead of having 
three means for groupings on the basis of the rural-urban charac¬ 
teristic, we would now have two sets of such means, one set for 
each sex. Instead of two means for the total group classified by 
sex, we would have three sets of sex means, a set for each of the 
three residence categories. 

This type of breakdown and similar ones where percentages 
instead of means are involved were utilized in psychological re¬ 
search long before the advent of the analysis of variance technique. 
The further breakdown of each sex group for residence status (or 
of residence groups for sex) is made in order to see whether rural- 
urban differences hold for the sexes separately (or whether the 
sex differences are similar for each of the separate residence groups). 
Although researchers were not confined to the single variable 
approach before the invention of the variance technique, they 
were definitely limited in the possible statistical treatment of 
their data. Now that we have the analysis of variance method, 
we have an adequate statistical technique for checking such 
hypotheses as can be formulated concerning the influence of not 
only one but two or more variables. The advantages of using 
analysis of variance for such situations may be briefly 
mentioned. 

First, as we have already seen, it provides an over-all test of 
the significance of the difference between two or more means 
when either large or small samples are involved. 

Second, we shall soon see that it leads to a definitely improved 
estimate of sampling error when double or triple or higher-order 
classification is involved. For instance, when the older method 
is used to check the significance of the difference between the two 
sex means for the total group, the determination of the sampling 
error makes no allowance for likely heterogeneity in intelligence 
associated with residence status. The variance method permits a 



Analysis of Variance: Complex 269 

refined estimate of error by allowing for variation due to one or 
more variables when one is testing the differences between groups 
classified on the basis of some other variable. 

Third, the variance technique provides a means of testing 
whether the influence of one independent variable on the depend¬ 
ent variable is similar for subgroups formed on the basis of a second 
independent variable. In a sex-by-residence analysis of IQ^s, 
the breakdown of each residence group by sex will likely show 
that the sex differences are not exactly the same for the three 
groups and that rural-suburban-urban differences are not exactly 
alike for the separate sex groups. Such inconsistencies as seem 
apparent from examination of the six cell means may not be real 
for the simple reason that random sampling errors are present. 
Before the development of the variance technique there was no 
way of testing such apparent inconsistencies, except when each 
classificatory characteristic leads to just two categories. 

This last point has to do with what has been termed interaction^ 
a concept which is not easily understood. Rather than provide 
a detailed discussion now of what is meant by interaction, we will 
give a simple illustration. Suppose it has been found that one 
learning method has a distinct advantage over a second method, 
but that, when the data are broken down for two recall intervals, 
the superiority of the first method seems to hold only for those 
with the shorter recall interval. This failure of the first method 
to be consistently better becomes an example of interaction. 
Before concluding that there is evidence for real interaction, one 
needs to apply a statistical test. For such a simple breakdown, 
one could compute the difference between the first and second 
method means, and the standard error of the difference, for those 
with the short recall interval; likewise, for those with the long 
interval; then one could determine the difference between the 
differences and its standard error and therefrom obtain either a 
critical ratio or a ^ as a test of inconsistency. But, when one thinks 
of a situation with three methods and three or four recall intervals, 
it is immediately obvious that such a simple test cannot be 
applied. 

It is the purpose of this chapter to present the methods of 
analysis to be used when classification into groups is made on the 
basis of two or more variables. These extensions, which are re¬ 
stricted by the underlying assumptions of normality and homo- 



270 


Analysis of Variance: Complex 

geneity of variance for the dependent variable, are applicable for 
either large or small samples and are particularly helpful with 
small samples when it seems imperative that we “get the most 
out of the available data.'' 

DOUBLE CLASSIFICATION 

Suppose that the individuals (or their scores) are classifiable 
into C groups on the basis of one characteristic or variable and 
into R groups on the basis of a second variable. This would lead 
to a table with RC cells. Let us first examine the setup where 
we have only RC scores, i.e., one score for each cell. It is con¬ 
venient to let Xrc stand for the score in the rth row and cth column 
of such a table. The score in the first row (from the top) and 
third column would be symbolized as X 13 . The general pattern 
of labeling the scores is set forth in Table 45, which also includes 
along the margins a symbol for the several possible row and column 
means. Note that the first subscript identifies the row and the 

Table Jfi, Schema fob Labeling Scores and Means for Groups, Double 

Classification 


■ 

■ 

2 

3 

c 

c 

■ 

1 

Xu 

Xi* 

Xis 

Xlc 

Xic 

X,. 

2 

X*i 

X22 

X23 

Xu 

Xic 

^2. 

3 

X,i 

X32 

Xss 

Xu 

Xgc 


r 

Xrl 

Xr2 

Xr, 

Xrc 

XrC 

^r. 

R 

Xm 


Xr, 

Xrc 

Xrc 

Xk. 

■ 

X.i 

^.2 

J .8 

X.C 

j?.c 

X 


second the column to which a score belongs. The scheme used 
in denoting means should be grasped. Thus X .2 is the mean for 
the second column, whereas .^ 2 * is the mean for the second row. 
The “dot" in the subscript indicates the direction of the summing 
for computing a mean—^to get we sum Xr 2 scores with r 
taking on values running from 1 to B. 






Double Classification 


271 


The deviation of any score, Xre, from the total mean can be 
expressed in terms of the deviation of its row mean from the total 
mean, (Xr- — X), plus the deviation of its column mean from 
the total mean, {X.c — ^), plus a sort of remainder term which 
represents an individual variation over and above that due to 
the groups to which the score belongs. To secure an expression 
for this term, we note that by definition the term must be the 
part of the score deviation (from the total mean) left over after 
the sum of the two parts specified above have been subtracted. 
Accordingly, we have 

{Xrc - X) - [(Xr- - ^) + (X-c - X)] 

which Bmplihes to 

{Xrc - - X c + 

We may therefore write the following identity: 

(Xrc - = (Xr. - J) + (X.e - X) + (Xrc - + X) 

With r running from 1 to R, and c taking values from 1 to C, 
there will, of course, be RC individual deviations. We need the 
sum of their squares, which sum will involve the squares of the 
three parts, plus three cross-product terms that can be shown to 
vanish when simimed. It may be instructive to indicate how the 
sum of squares for all RC cases can be set up. Suppose we begin 
by writing the squares of the deviations for scores in the first 
column. Each of these squares will involve cross-product terms, 
which we shall here ignore except for a plus sign to indicate their 
existence. We have for the first-column scores: 

(Xn - X)2 =: (Xi. - X)2 + (X.i -- X)2 + (Xu - Xi. - X.i + X)^ + .. • 

(X 21 - X)2 = (X 2 . ~ X)2 + (X.i - X)2 + (X 21 - X 2 . - X.i + X)2 4 - • • • 

(X ,1 - X)2 = (X,. - X)2 4- (X.I - X)2 4 - (Xri - Xr. ~ X-i 4" X)^ + * * * 

(Xfli - X)2 = (Xfi. - X;2 + (X.I - X)2 4- (X«1 - Xr. - X.i 4- X)2 4- • • • 

The summing of these squares of deviations for scores of column 1 
involves R cases, i.e., r runs from 1 to i2; hence we need a symbol 

r 

which denotes this fact. Let us use 2 for this purpose. Note 
that the second term on the right is constant for all R cases, which 
permits us to replace the sununation sign by R. 



272 Analysis of Variance: Complex 

The sum of the first column squares, and by analogy the sums 
for the other columns, can be written as: 

1st col.: 

2(Xrl - X)2 « - X)^ + R(X.i - X)2 + i(Xrl - - X-i + X)^ 

2nd col.: 

i{Xr2 - X)^ - i(Xr. - X)^ + R(X.2 ~ X)^ + S(Xr2 - Xr. - X.2 + X)^ 
cth col.: 

i(Xrc - X)^ = S(X,. - X)^ + R{X.c - Xf + i{Xrc - Xr. - X.c + X)^ 
Cth col.: 

S(XrC - X)^ = SCXr. - X)^ + RiX.c - Xf + SCXrC " + X? 

We may now sum over the C columns, and for the results we will 
need double summation signs. Since the first right-hand term 
does not vary from column to column, its sum is merely C times 
its value. The second right-hand set of terms involves a constant 
times a variable; hence the constant R comes from under the 
summation sign. Finally we have the following expression for 
the sum of squares for the RC scores: 

SS(X,c - X)® = CSCX. - J)2 + - Xf 

+ SS(X„ - Xr- - x.c + Xf ( 101 ) 

The reader who is worried about whether the cross-product 
terms really vanish should note that for the cth column the product 
term 

s(X,. - X)(X.c - r) = iX.c - X)z(Xr. - X) 

r 

vanishes because 2(Xr. — X) = 0. The other two cross-product 
sums have as one factor the remainder or residual term; we have 
already had examples of a general principle that product terms 
involving residuals vanish. 

From formula (101) we see that the total sum of squares can 
be broken into three additive components: between row means 
with B — 1 degrees of freedom, between column means with df 
of C — 1, and a remainder. The degrees of freedom for the last 
part can be ascertained by a principle analogous to that used for 



Double Oassification 


273 


getting the df for contingency tables. The marginal means 
constitute restrictions on the deviation score entries in the rows 
and columns—^when deviation scores for (ft — 1)(C — 1) cells 
are filled in, the rest of the entries become fixed; hence df 
= (/2 — 1)(C — 1). Note that the d/’s for the three parts sum 
to the df for the total sum of squares or RC — 1. 

Dividing the three sums of squares by their d/’s leads to three 
variance estimates, for that based on rows, for columns, 
and s^e for that based on the remainder, sometimes called error, 
sum of squares. If the rows and columns represent groups drawn 
from the same population, these three estimates are unbiased 
estimates of the population variance. We have two null hypoth¬ 
eses: that the row means are chance variations from one popula¬ 
tion mean, and that the column means are also variations from 
one population mean. As in the simpler situation, if the estimate 
based on rows is larger than expected on the basis of chance, it 
follows that there are real differences between the population 
means for the groups defined by the rows; likewise, for column 
means. 

In testing the significance of the between-groups variance, we 
use the estimate based on the remainder as the denominator of 
the F ratio. For testing the variation of row means, we have 
F = s^r/s^e with ni = — 1 and 712 = (R — 1)(C — 1). For 
column means, F = with ui = C — 1 and 112 

= (72 — 1)(C — 1). If an F so defined happens to be less than 
unity, we know at once without reference to the table for F that 
the variations of the given means are insignificant. Note that, 
since the error variance used in the denominator is a ifesidual after 
the parts of the total associated with between-row and between- 
colunrn variations have been subtracted, it follows that we are 
using as our error term a variance which has been freed of the 
influence of heterogeneity with respect to the two classificatory 
variables being investigated. 

For many situations involving double classification, it would 
seem that the method just outlined would be definitely limited 
in usefulness because no provision has been made for increasing 
the size of the sample except by using finer grouping on one or 
both of the independent variables. Finer grouping would be 
possible, though not always feasible or desirable, for some classifi¬ 
catory variables, such as degree of illumination or amount of 



274 


Analysis of Variance: Complex 


education or size of type, but for other bases for forming groups 
there are definite limits on the number of groups. For example, 
in the study of reaction time the number of possible groupings for 
sense modality is limited. Actually, the number of cases can be 
increased by having additional individuals assigned to each of the 
RC cells. Before taking up this needed modification of the setup, 
we shall discuss certain specific situations where the scheme as 
presented is of practical use. We are not ignoring the possibility 
that sometimes RC cases are enough for testing hypotheses even 
when both R and C are as small as 4 or 5. 


SIGNIFICANCE OF THE DIFFERENCES BETWEEN CORRELATED 

MEANS 


Suppose that the RC scores are for R individuals working under 
C different conditions. The mean of a row would be for an indi¬ 
vidual, and the mean of a column would be for a specified condi¬ 
tion. Let us consider the limiting case of C = 2. The between- 

c 

columns sum of squares, i2S(X.c — may be written as 
R{Zi - + R{I.2 - 

which we have already shown (p. 246) reduces to {R/2) (X.i — X. 2 )^, 
or to a function of the difference between the two means. 

Let us next examine the remainder or error term. If we turn 
back to p. 272, where we summed over columns, we readily see 
that the remainder sum can be expressed as 

kXrl - X,. - Xi + Xf + 2(X,2 - Xr. - X 2 + 

in which the c of formula (101) has the explicit values of 1 and 2. 
Now the mean of any row, say the rth, is merely the mean of 
C = 2 scores; i.e., Xr- = {Xri + Xr2)f2y and the total mean must 
be the average of the two column means, ot X = (X.i + X. 2 )/ 2 , 
Making these substitutions, we have 


i(Xn- 


Xrl + Xr2 


— X.i + 






The Differences between Correlated Means 


275 


which simplifies to 

-Xr2-Zi + + lkXr2 " :S:rl - + X.x)“ 

These two terms become identical when we change the signs 
within the second parentheses, which change is permissible since 
the square of a function is the same as the square of its negative, 
e Hence we have 

^k(Xri - Xr2) - (^1 ~ X.2)? 

Now the first parentheses term is the difference between any 
individuals two scores, say Dr, and the second is the difference 
between the two column means, which difference it will be recalled 
is the same as the mean of the differences, D. We have finally 

r 

the remainder sum of squares as — D)^, or one-half the 

sum of the squares of the difference scores about the mean dif¬ 
ference. 

The F for comparing two column means becomes 

^ ^_ 1 _ 

- 5)2 
R - 1 

with ni = 1 and n 2 == 12 — 1. This reduces to 

„ (J.l - X.2)^ 

-— 

S(Dr - D)2 
R{R - 1) 

which the reader will recognize as for comparing the difference 
between means based on gets of correlated scores with the standard 
error of the mean difference estimated by formula (92). 

We have seen in Chapter 5 that in testing the difference be¬ 
tween the means of correlated scores we can, for the large sample 
situation, determine the needed sampling error either from the 
distribution of differences between paired scores or by means of 
the standard error of the difference formula with the correlational 



276 Analysis of Variance: Complex 

term included. The important thing to note is that the analysis 
of variance technique provides a method for testing the signifi¬ 
cance of the difference between two or more means based on sets 
of correlated scores. The scores may be correlated either because 
they are based on the same individuals working under C condi¬ 
tions or having C trials on some stunt, or because siblings or litter 
mates are involved (each of the C groups containing one case from 
each of R families), or because we started with R sets of matched 
individuals, one from each set being assigned to the several C 
groups. After and only after it has been found that the F for the 
C column means is significant are we justified in using the critical 
ratio or t technique to test the significance of the difference be¬ 
tween any two of the C means. 

The F just discussed has to do with column means. What of 
the row means for the given setup? The means of the R rows 
represent the mean performance of each of the several individuals, 
and a test of the significance of the estimate of variance based on 
the between-row sum of squares becomes a test of the significance 
of individual differences. Since it is known that individuals do 
differ on practically all psychological variables, such a test is 
usually a trivial test of the obvious, and hence it is seldom needed. 
We may, however, have the situation in which we wonder whether 
individual variation is significant in the light of known measure¬ 
ment or response errors. To this question we now turn. 


RELIABILITY OF MEASUREMENT 

Suppose the scores in each row represent either the perform¬ 
ance of an individual on different forms of a scale or C measure¬ 
ments for a given variable. The column means would be the 
means for the forms or successive sets of measurements, and the 
test of the significance between column means would be a test of 
the difference between the 'several form means or of the difference 
between the means for the C successive sets of trials. For form 
means or for trial means, F = ^c/^et as outlined above, provides 
an over-all test of the significance of these correlated means. 

In order better to understand the meaning, in this situation, of 
an F based on let us again take the limiting case of C = 2; 
e.g., suppose two forms of a test have been administered to R 
individuals. The algebra is simplified and an interesting, clear- 



Reliability of Measurement 277 


cut result emerges if we assume that the two forms yield exactly 
the same means, i.e., that = X 2 = X, Then the remainder 
sum of squares, 

ii{Xrc - 

becomes 

xhXrc - Xr-? 

This can be written without the double summation sign as 
i{Xn - Xr.? + hXr2 - Xr.? 

Since the mean of each row is simply the average of two scores, i.e., 

X.i + X.2 

2 

the above can be written as 




which by a little algebraic manipulation reduces to 

- X,2)" 

Since we have assumed that the form means are equal, the dif¬ 
ference scores in this expression will have a mean of zero. There¬ 
fore, if we divide the sum of the squared differences by fi, the 
number of individuals, we will have the variance of the distribu¬ 
tion of differences, which we symbolize by (t^dd* 

It follows that 

^S(Xrt - Xr2? = 


Now it can be shown by easy algebra (see pip. 73-74) that 
( t^dd = 0-^1 + 0-^2 *“ 2ri2<rio'2 

in which the o-’s are measures of variation for forms 1 and 2 respec¬ 
tively, and ri 2 is the correlation between forms. If we make the 
usual assumption that the two forms are so nearly comparable 
that we can replace <ri and 0-2 by o-, we have 

<r^DD = + 0"^ — 2ri2<r<r = 2o-^(l — ri2) 



278 


Analysis of Variance: Complex 


Then {R/2)a^DD becomes Ra^(l — ri 2 ). But ri 2 defines, and is, 
the reliability coefficient, and hence cr^(l — ri 2 ) is the error of 
measurement variance, <r^e, so that we finally have the remainder 
sum of squares equal to Ra^e- 

Thus, under our simplifying assumptions of equal form means 
and equal form variances, assiunptions which are usually made 
in connection with test reliability, we see that the remainder term 
is directly associated with the familiar error of measurement 
variance. The remainder term as actually computed from the 
sum of squares includes an adjustment for possibly differing form 
means but no allowance for differing form variances, so it will 
not exactly equal Rcr^e- The remainder sum of squares does, 
however, lead to an estimate of the error of measurement vari¬ 
ance, not only in the situation where we have an analysis based 
on two forms but also where three or more forms are involved; 
accordingly, when we test the significance of the variance for 
between-roiy means, we are actually asking whether the individual 
differences are significant in light of the variability due to measure¬ 
ment errors. 

Since the reliability coefficient is a function of the error variance 
relative to the observed trait variance, it follows that a significant 
between-individuals variance is evidence for statistically signifi¬ 
cant reliability. But one cannot conclude from this that the test 
or instrument possesses satisfactory reliability since coefficients 
as low as .20 or .30 or even .10 can be statistically different from 
zero if -B is sufficiently large. The author does not recommend 
this approach to the question of the reliability of measurement 
for the simple reason that it is more important to know how reliable 
a test is or how near its reliability approaches unity than to know 
only that it is reliable in the sense of yielding a coefficient signifi¬ 
cantly different from zero. 

This possible application of the variance technique, however, 
points up the fact that it is sometimes meaningful to speak of the 
remainder variance as ‘‘error^^ variance. In a wider sense, the 
remainder variance can be thought of as the uncontrolled varia¬ 
tion which contributes to the variation of the means of the groups 
being compared. Now a little reflection leads one to the conclusion 
that the sources of error in research are many and varied. Some¬ 
times instrumental and/or measurement errors loom large, some- 



Reliability of Measurement 


279 


times the error associated with the sampling of individuals is 
paramount, at other times the intraindividual variation is sizable, 
and frequently if the sources of variation are unknown the term 
experimental error is used as a catchall. When a particular vari¬ 
ance estimate is referred to as the error variance to be used as the 
denominator of the F ratio, the ^'error'^ may be any one of or a 
combination of the many types of error. In this sense, the variance 
estimate based on the remainder sum of squares may be the error 
variance even for those situations where we have classifications 
into R groups rather than as R individuals, but as will presently 
be seen the term which we are now calling the remainder may not 
always be the one to utilize as ‘‘error.” The within-groups vari¬ 
ance estimate of the last chapter was an “error” variance for 
testing the significance of the between-groups variation. In more 
complex setups in the analysis of variance, judgment is required 
in choosing the appropriate error term. 

Parenthetically, it might be pointed out that the test reliability 
problem can be tackled by the within- and betw^ecn-groups vari¬ 
ance estimates. Each person for whom we have two or more, say C, 
measurements yields a set of C scores, and the variation within 
such a set is partly a function of measurement errors; hence the 
over-all witliin-groups (intraindividual) variance estimate be¬ 
comes an error term by which one may test the significance of the 
between-groups (between-individual) variance. Note that this 
within-groups or intraindividual approach will lead to an estimate 
of the error of measurement variance without an adjustment for 
possible differences in form means, and that it does not permit a 
test of the significance of the difference between form means, 
which is possible when the double classification scheme is utilized. 
Either of the two methods for determining whether the reliability 
is sufficient to possess statistical significance is applicable for an 
over-all evaluation of C forms or C successive measurements or 
trials. With C forms and R individuals, it is of interest to make a 
comparative layout of the two approaches, that based on the 
double classification scheme of this chapter and that based on the 
single classification procedure of the last chapter. Table 46 con¬ 
tains the essentials. 

Note that both F = s^r/s^e and F = provide tests of 

the significance of reliability by way of the significance of indi- 



280 Analysis of Variance: Complex 

vidual differences. The df for the estimate is C — 1 smaller 
than that for a trivial difference in the practical situation 
where C is seldom more than 2 or 3, and R is usually 100 or more, 
rarely as small as 25 or 50. Both and constitute estimates 
of the error of measurement variance, but Sc, because of the adjust¬ 
ment for differing form means, will be smaller than Syy, Whether 
either of these estimates is useful as indicating precisely the meas¬ 
urement error for a particular form depends upon the extent to 


Table Ji6. Two Approaches to Test Reliability Problem 


Via Double Classification 

Via Single Classification 

Variance 

df 

Variance 

df 

Estimate 

Estimate 


R - 1 


R - 1 


C - 1 




{R - 1)(C - 1) 

8^w 

R(C - 1) 


which the standard deviations for the several forms are similar. 
Either error variance may be used to estimate the magnitude of 
an over-all (forms) reliability coefficient. The estimates for an 
r would be (1 — and (1 — where the variance 

estimate based on the total sum of squares with RC — 1 degrees 
of freedom, is taken as the best estimate of the trait variation as 
measured by a single form or a single set of measurements or just 
one trial. 


COMPUTATIONAL ILLUSTRATION 

The required computations for testing variation between column 
means and between row means will now be set forth. It makes no 
difference in the computational procedure whether we have RC 
individuals classified into R groups one way and C groups another 
way or R individuals with C scores each or R sets of C individuals 
matched or RC scores for just one individual. Our illustrative 
example will be of the last-mentioned type. 













Computational Illustration 


281 


The computation of the required sums of squares involves an 
extension of formulas (97), as follows: 

ii(Xro - r)2 = -^ - (SSX„)''] for total (102o) 

RC 

RiiX-c = ^ [Ck^Xrcf - (SSX„)2] 

KC 

for columns (1026) 

CS(Jr. = ^ [Ri(iXrc)^ - (iiXrc)^] for TOWS (102c) 

RC 

The sum of squares for the remainder can be obtained by sub¬ 
tracting the sums for between colmnns and for between rows 
from the total sum of squares. Formulas (102) may look forbid¬ 
ding at first, but actually the sums based on raw scores are easily 
secured by following a plan on the work sheet. Sum each row, 
and write the sums on the right-hand margin; sum each column, 
and write the sums along the bottom margin. Summing down 
the right-hand margin gives the total sum, and summing across 
the bottom margin should give the same total sum. Square all 
scores and sum to get the first sum in (102o); square all the right- 


TaUe 47 . Data for Judged Whiteness as Related to Level op Illumi- 

' NATION AND AlBEDO * 


Level of 


Albedo 


e 


Illumination 

.07 

.14 

.26 

.54* 

^Xrc 

Xr. 

1.20 

11 

24 

60 

78 

173 

43.25 

2.00 

14 

24 

65 

79 

182 

45.50 

3.32 

24 

55 

68 

80 

227 

56.75 

5.51 

35 

48 

70 

80 

233 

58.25 

9.15 

12 

35 

61 

79 

187 

46.75 

^Xrc 

96 

186 

324 

396 

1002 


X.C 

19.20 

37.20 

64.80 

79.20 

50.10 



X2Xrc => 1002; = 62,404 

i{ixrc)^ = 305,604; idxrc)^ = 203,840 


* Data from R. E. Taubman, J. Exp, Psychol.^ 1945, 36, 235-241. 



282 


Analysis of Variance: Complex 

hand margin sums and then sum to get the first part of (1026); 
square all the bottom margin sums and then sum to get the first 
part of (102c). 

The student may do well to sit down at a calculator and perform 
these operations with the scores of Table 47, which contains data 
on judged whiteness for 5 (= R) levels of illumination and 4 (= C) 
degrees of albedo. Only one observer is involved, and the score 
in each cell is the average for 10 judgments. Casual examination 
of the table would indicate that judgment is influenced by albedo 
and level of illumination. Do the means for the 5 levels of illu¬ 
mination differ significantly among themselves? Do the means 
for the 4 degrees of albedo differ? 

The required sums of raw scores are also included in the table. 
Substituting these in the above formulas gives: 

t^[ 20(62,404) — (1002)^] = 12,203.80 for the total sum of 

squares 

•^[4(305,604) ~ (1002)^] = 10,920.60 for between-columns 

squares 

^[5(203,840) - (1002)2] = 759.80 for between-rows 

squares 

Subtracting the sum of the last two from the total gives 523.40 
as the remainder sum of squares. 

These results are assembled in Table 48 along with the d/'s 
and the variance estimates. The F ratios are also included. The 


Table J ^, Variance Table for Data on Judgment op Whiteness 



Albedo F « 3,640.20/43.62 - 83.46; ni = 3, n 2 « 12; P < .001 
Illumination F *» 189.95/43.62 =« 4.35; wi « 4, n 2 = 12; P « .03 













Double Classification with m Scores per Cell 283 

influence of albedo is, as one would expect, highly significant, the 
F of 83.45 being far beyond the F of 8.28 required for the P = .001 
level of significance. The F for illumination falls near the .03 
level of significance. 

Some readers may wonder just what use has been made of the 
fact that each ‘^score” in the above setup is actually based on 
10 judgments. Was there any advantage in having 10 judgments 
instead of 1 judgment for each combination of conditions? A real 
gain is present: the 20 “scores’’ being analyzed are more reliable 
or stable because each is itself an average. This fact means that 
the “experimental error” involved in the remainder has been 
somewhat, perhaps considerably, reduced. We could, moreover, 
make definite use of the 10 judgments about the cell averages if 
we wished to answer the question of whether the variation in 
judged whiteness associated with degree of illumination is of the 
same order for the various levels of albedo. This is an example 
of the problem of interaction, to which we shall presently return, 

DOUBLE CLASSIFICATION WITH MORE THAN ONE SCORE 

PER CELL 

Suppose that we have m scores in each cell of schematic Table 
45. This would lead to a mean for each cell, and about each such 
mean we would have the variation of m scores. The mean for 
the rth row would be the mean of all mC scores in the row, or the 
mean of the C cell means of the row; the mean of the cth column 
would be the mean of the mR scores in the column, or the mean 
of the cell means in the column; in the remainder term, previously 
defined as {Xrc — Ar- — X.c + A), we would replace Xrc by 
Xrc- The total sum of squares for all mRC scores would include 
a between column, a between row, and a remainder component, 
'plus an additional part wliich would involve the variation within 
cells about the cell means. A convenient label for this new part 
would be SS(A'rc — in which it is understood that there are 

m such deviations in each cell. A more precise notation would be 

ire 

222 (Xirc — Xrc)^, in which X^rc is the 2 th score in the cell involv¬ 
ing the rth row and cth column. 

The variance table would take on the form indicated in Table 
49, in which the term “remainder” has been replaced by “inter¬ 
action.” Note that the first two sums of squares are simply m 



284 Analysis of Variance: Complex 

times the corresponding sums for one score per cell, and that the 
dfs for these sums and for the one corresponding to the remainder 
sum are not changed. The df for the within-cells sum depends 
upon the fact that there are w — 1 degrees of freedom in each of 


Table 49. Variance Schema for Double Classification with m Scores 

PER Cell 


Source 

Sum of Squares 

df 

Variance 

Estimate 

Rows 

mCS(r,. - 

R - 1 


Columns 

- r)* 

C - 1 


Interaction 

»nSS(Xo - X. - 

(R - 1)(C - 1) 


Within cells 

iz(Xrc - 

mRC - RC 


Total 

22(Xrc - X)* 

mRC — 1 



the RC cells, which gives RC{m — 1) = mRC — RC as the df. 
We now have, on the basis of the null hypothesis, four estimates, 
^cf ^if and s^wj of the same population variance. 

This simple modification of the setup for the analysis of variance 
leads to two definite advantages. We can increase the precision 
or dependability of our results by basing the analysis on more 
scores or cases, and we can test the possible significance of the 
interaction component. Before we discuss the first advantage, 
it is necessary that we consider the question of possible inter¬ 
action, the exposition of which is facilitated by an example, which 
will also serve to illustrate the required computations. 

The computational formulas are extensions of previously used 
formulas. A SX and is calculated for each cell. Summing 
the RC values gives SSX^ as the sum of all the mRC squared 

c 

scores. Summing the SX values in each row gives SXrc, and 

r 

summing the SX values in each column gives SXrc. These be¬ 
come sums along the margins, which marginal values sum down, 
and across, to the total sum of the mRC scores, SSX^c. The 












Double Classification with m Scores per Cell 285 

sum of scores in any particular cell will be symbolized as SX^c. 

The formulas are: 

Total sum of squares =- [mRC'E'SX^rc — (SSXrc)^] (103a) 

mRC 

1 r c 

Between-rows squares =- [R^(ZXrc)^ — (SSXrc)^] (103b) 

mRC 

Between-columns squares 

= iciiixre)^ - (S2X,,)2] (103c) 

mRC 

Within-cells squares = — [mSSX^rc — 2)(2Xrc)^] (103d) 

m 

The interaction sum of squares is obtained as the remainder when 

the numerical values of formulas (103bcd) are subtracted from 

the total sum of squares. 


Table 50. Coded Learning Scores (Sum op Scores on 29th and 30th 
Trials) for Koerth Pursuit Rotor * 


Rest 

Practice Sessions 

Interval 

5(M T W Th F) 

3(M W F) 


,9 

14 

6 

10 

8 

10 

11 

14 


10 

15 

10 

11 

9 

7 

9 

10 

[3 minutes 

14 

17 

10 

11 

9 

12 

13 

14 


10 

7 

8 

15 

12 

13 

7 

17 


12 

8 

14 

6 

9 

12 

8 

15 


2 

6 

1 

9 

11 

12 

9 

7 


5 

9 

2 

11 

9 

6 

11 

9 

1 minute 

14 

1 

1 

8 

6 

8 

11 

12 


14 

4 

11 

5 

9 

7 

4 

10 


6 

8 

2 

5 

13 

6 

7 

8 


* Data from Renshaw, M. J., The Effects of varied arrangements of practice 
and rest on proficiency in the ac^iiisition of a motor skilly Unpublished Doctor^s 
Dissertation, Stanford University, California, 1947. 



286 


Analysis of Variance: Complex 


Table 50 contains data on learning with 2 variations as to prac¬ 
tice sessions and 2 variations as to rest interval between trials. 
For each combination of conditions there are 20 (== m) cases. 
The scores are recorded in a 2 by 2 or 4-cell table. Table 51 is a 
work-sheet layout in which are recorded sums of scores, sums of 
squared scores, and means, for cells and for the margins. The 
lower-right corner contains values for the total group of 80 cases. 


Table 51. Sums and Means for Data op Table 50 


Rest 

Interval 

Practice Session 

Totals 

5(MTWThF) 

3(M W F) 

3 minutes 

XXii =217 
= 2543 

Xn = 10.8500 

2Xis ■= 219 
= 2547 

Xn = 10.9500 

2Xu = 436 
2X2i, = 5090 

Xi. = 10.9000 

1 minute 

SX 21 = 124 
= 1102 

X 21 = 6.2000 

= 175 
2X*22 = 1643 

X 22 = 8.7500 

2 X 20 = 299 

2X*2c = 2745 

X 2 . = 7.4750 

Totals 

ii II II 

XXn =394 
XX^ri = 4190 

X.2 = 9.8500 

22X„ =735 
22X^0 = 7835 

X = 9.1885 


For the sums of squares (of deviations) we have the following: 

Total: ^ [80(7835) - (735)^] = 1082.1875. 

Rows: ^[2(4302 + 299^) - (735)^] = 234.6125. 

Columns: ^[2(34:1^ + 394^) - (735)^] == 35.1125. 

Within cells: igV[20(7835) - (217^ + 219^ + 124^ + 1752)] = 
782.4500. 

Interaction: 1082.1875 - (234.6125 + 35.1125 + 782.4500) = 
30.0125. 

The interaction sum of squares can also be calculated by direct 
substitution into the definition formula of Table 49, which will 
involve RC quantities to be squared, summed, and multiplied by 

















Double Classification with m Scores per Cell 287 

m. We have 

(10.85 - 10.90 - 8.525 + 9.1885)2 = (.6125)2 

(10.95 - 10.90 - 9.85 + 9.1885)2 = (-.6125)2 

(6.20 - 7.475 - 8.525 + 9.1885)2 = (-.6125)2 

(8.75 - 7.475 - 9.85 + 9.1885)2 = (.6125)2 

which when added and multiplied by 10 lead to 30.0125, or the 
value obtained by subtraction. 

Any reader who is surprised that the above four values, involved 
in computing the interaction sum of squares directly, are numer¬ 
ically equal should ponder the fact that for the given situation 
the dj for the interaction term is (2 — 1)(2 — 1) or 1. 

Actually, the easiest way to compute the interaction sum of 
squares for a 2 by 2 table is to work with the 4 cell sums of scores. 
The formula is 


^ (SXn + 2X22 - 2X12 - 2X21)2 
4m 

For this problem we have 

^(217 + 175 - 219 - 124)2 = ^(49)2 = 30.0125 

The sums of squares and resulting variance estimates are 
brought together in Table 52. We first test the interaction vari¬ 
ance: F = 30.0125/10.2954 = 2.92, which falls short of the F 
of about 4.0 required for significance at the lenient .05 level of 

Table 52. Analysis of Variance for Pursuit Learning 


Source 

Sum of Squares 


Variance Estimate 

Rest interval (rows) 

234.6125 

1 

234.6125 

Sessions (columns) 

35.1125 

1 

35.1125 

Interaction 

30.0125 

1 

30.0125 

Individual differences (within 
cells) 

782.4500 

76 

10.2954 

Total 

1082.1875 

79 











288 Analysis of Variance: Complex 

significance. This indicates that the apparent failure of the cell 
means to be consistent, in either direction, with the marginal 
means is probably due to the chance fluctuations of the cell means. 
For this particular problem the source of these chance fluctuations 
is the sampling of individuals; for the albedo-illumination data 
cited earlier, the source would be a sampling of judgments. It is 
to be noted that, if the apparent systematic fluctuation of the cell 
means, i.e., the interaction variance, is no larger than expected 
on the basis of a known source of error—^whether it be individual 
differences or response errors or measurement errors—^it can be 
said that, on the basis of the information obtainable from the 
data at hand, this is the only source of error for the marginal 
means. Accordingly, the between-row and between-column vari¬ 
ance estimates can be tested for significance by using as our error 
term the best estimate of the one known source of error. In the 
pursuit learning investigation, the best estimate is the within- 
cells variance. It is true that the interaction variance is, imder 
the null hypothesis, also an estimate of the same trait variance, 
but it is not as good an estimate as that based on within cells 
because the latter has a larger df. 

But what of a significant interaction F? This would indicate 
that the failure of the cell means to be consistent with the marginal 
means is not due solely to a known source of error, such as sam¬ 
pling of individuals. Consequently, we would suspect that the 
data are subject to another source of variation in addition to that 
due to individual differences and that associated with the classifi- 
catory variables. Since this additional source of error could 
easily affect the marginal means, we should, in testing the signifi¬ 
cance of the row and column variance estimates, make some 
allowance therefor. In other words, or the within-ccll variance 
estimate, is not an appropriate error or denominator term for F 
because it does not include the demonstrated interactive effect. 
What shall we choose as‘ the error term? Obviously, whatever 
term we use should be an estimate which also includes the factor 
of individual differences as a source of error. The cue as to choice 
comes from the fundamental idea that s^ry s^iy and are all, 
on the basis of the null hypothesis, estimates of the same popula¬ 
tion variance. Now, if the interaction variance, turns out to 
be significantly larger than it follows that s^i is an estimate 
of the population variance plus the variance due to real interac- 



289 


Triple Classification 

tion. Accordingly, as an estimate which includes the two 
sources of variation, becomes an appropriate error term for the 
F ratios to be used in testing the significance of the differences 
between row and between column means. 

We thus see that having more than one case per cell permits a 
statistical check of whether the results are internally consistent, 
i.e., whether the effect of one variable is similar for subgroups 
formed on the basis of a second variable. When the interaction 
variance estimate is significant, it is the proper error term for the 
F ratio. In the situation where only one score per cell is available, 
the remainder variance estimate, which must be used as the error 
term, includes a possible though not testable interaction com¬ 
ponent. 

We may now consider the effect on pursuit learning of varying 
the rest interval and of varying the sessions. With the inter¬ 
action not significant, is the correct error term for F. For 
sessions, we have F = 35.1125/10.2954 = 3.41, which is not 
large enough to lead us to reject the null hypothesis; but, since 
nonrejection of the null hypothesis does not prove the h 3 rpothesis, 
we can conclude only that the effect, if it exists, is not large enough 
to be demonstrated with the number of cases used. The between- 
rows or rest-interval effect is highly significant as judged by 
F = 234.6125/10.2954 = 22.79, which is double the F needed for 
the .001 level of significance. It is exceedingly doubtful whether 
this effect would havfe been demonstrated by using four cases, 
one per cell. 

TRIPLE CLASSIFICATION 

Suppose that we wish to arrange an investigation so as to let 
one set of data serve to determine whether the variation of a 
dependent variable is due to or associated with variation on three 
independent variables. Again, the term independent variable is 
being used in its broad sense. It might be a “real'' variable like 
illumination, temperature, amount of food, length of rest interval; 
or it might be a variable having to do with qualitative differences, 
such as kind of food, type of motivation or incentive, various 
psychological sets. It makes no difference whether the variables 
are manipulatable in the laboratory, as would be true of all those 
mentioned, or whether the desired variation is secured by appro¬ 
priate choice of cases. 



290 


Analysis of Variance t Complex 


It is necessary that we be able to assign individuals or scores to 
each combination of groupings made possible by whatever classi¬ 
fications we have on the three independent variables. Let us 
suppose that there are C categories on one variable, R on another, 
and B on & third. For purposes of exposition and as a systematic 
way of arranging the data, let the C categories define C columns, 
the R categories R rows, and the B categories B blocks. Let 
Xrbc represent the score in the rth row, 6th block, and cth column, 
and let us assume for the time being that we have only one score 
for each combination. Thus X324 would be the only score in the 
third row, second block, and fourth colunrn. The scores may be 
arranged in some such systematic order as that in Table 53, 
which should be studied carefully by the reader. 

Note in particular how the various sums are specified and their 

c 

location in the table. The first two subscripts in SXnc indicate 
that this sum has to do with scores in the first row and first block, 
and that in the summing process c takes on values running from 

c 

1 to C. The general expression for all such sums is ^Xfhc- The 

r 

symbol stands for the sum of scores in the first column and 

first block; r takes on values of 1 to J?. The corresponding gen- 

r 

eral symbol is SXr6c- In next to the bottom section of the table 

will be found SXibi as the sum for all the cases in row 1 and 
column 1, the summing being through blocks; i.e., 5 takes on 

values from 1 to B. The general expression for such sums is 
6 

SXrbc- The sum of all the scores in the first block is symbolized 

r c r c 

as SSXric, and in the 6th block as SSXrSc. For the sum of eiII 
the scores in the first column, irrespective of row and block, we 

r h r b 

have XXXrbu the general expression is XXXrbe- The symbol 

be be 

S2Xi5c stands for the sum of all scores in the first row, and SSXrbc 
is the corresponding general expression. Note also how the ‘‘dot'' 
notation is used to specify the several means. The subscript 
which has been replaced by a dot indicates the direction of the 
addition required to obtain the sum for the given mean. Thus 
in X.24 the dot replaces r; this mean is based on R scores, with r 
running from 1 to B when we sum. The subscripts which are left 



Triple Classification 291 


Table 53. Score and Sum Schema for Triple Classification 




Column 

Sum 

Mean 








1 

c 

c 




Row 




c 



1 

Xm 

Xiu 

Xiic 

2Xii„ 

Xu. 

Block 1 

r 

Xrn 

Xrlc 

XrlC 

2X.ie 

^rl. 


R 

Xnn 

Xric 

Xric 

2Xflic 

Xri. 



r 

r 

r 

r c 



Sum 


SXrtc 


22Xrtc 


Mean 

X.n 

X.u 

X.ic 

X.i. 

Mean block 1 


1 

Xm 

Xlbc 

Xihc 

2Xi6. 

Xi6. 

Block h 

r 

Xm 

Xfhe 

XrbC 

iXrbc 

X.6. 


R 

XRhX 

XRhc 

Xrw 

ixsbc 

XRb- 



r 

r 

r 

r e 

X.b. 


Sum 

^Xril 

2Xrbc 

2Xr6C 

2SXrtc 


Mean 

^■bl 

X.bc 

X.bc 

X.6. 

Mean block b 


1 

XlBl 

XiBe 

XiBC 

ixiBc 

XiB. 

Block B 

r 

XrBl 

XrBc 

XrBC 

^XrBc 

XrB- 


R 

Xrbi 

Xrbc 

Xrbc 

^Xrbc 

Xrb- 



r 

r 

r 

r c 

X.B. 


Sum 


J^XrBc 

2X.BC 

ZXXrBo 


Mean 

X.Bl 

X.BC 

X.BC 

X.B. 

Mean block B 



h 

b 

b 

6 c 


Sums 

1 ' 

SXiM 

^Xibc 

2Xi6C 

2SXi6„ 

Xi.. 

through 


b 

b 

b 

h C 


blocks 

r 


^Xrbc 

^XrbC 

2SXrt. 

x... 



6 

b 

b 

b c 

Xb.. 


R 

SXkm 

^XRhc 

^XRbC 

XSXBic 



r 6 

r b 

r b 

r h c 



Sum 

SSXrM 

SSXrtc 

SSX.M7 

^XSXrbcX... 

Moans for 

1 

Xvi 

Xvc 

^IC 

Xi.. 

Moans for 

rows by 


Xr-l 

Xr^c 

Xr.c 


rows 

columns 

r 

x... 



R 

Xbi 

Xr.c 

Xn-c 

Xr.. 


Column means 

X..1 

K..C 

X..C 

X... 

= X 









292 


Analysis of Variance: Complex 


denote that the mean is for scores in the second block and fourth 
column. The total number of means will be as follows: 

RB means of the form 
RC means of the form Xr-c 
BC means of the form X.^c 
R means of the form Xr*. 

B means of the form 

C means of the form X..c 

One mean of the form X... = total mean = X 

Perhaps a better appreciation of the meaning of all these means 
can be obtained by a study of Fig. 18 , which pictures geometrically 
the situation for 2 blocks, 3 rows, and 4 columns. The individual 
scores can be thought of as in the cubicles of a 2 by 3 by 4 box. 
Summing through the box in the vertical direction leads to the 



X..1 X..2 X..S X..4 


Columns 

Fig. 18. Geometric picture of triple classification. 




293 


Triple Oassification 

8 means on the top; siunming in the forward-backward direction 
leads to the 12 means on the front surface; and summing through 
right-leftward leads to the 6 means on the side. Summing the 
means (or summing sums) across the front leads to the means 
placed along the vertical axis for the groups defined by the rows; 
summing the means (or sums) downward on the front leads to 
the means placed along the right-left axis for the groups defined 
by the columns; summing down on the side leads to the means, 
along the third axis, for the groups defined by the blocks. To 
get any of these means it is, of course, assumed that the sum 
involved is divided by the proper number. 

Of primary interest is the question. Is the variation among the 
means along the edges, considered separately, larger than expected 
on the basis of chance? To answer this we need to break down the 
sum of squares of deviations from the total mean into appropriate 
components. The score Xrhc m the cubicle defined by the rth 
row, 6th block, and cth column will vary more or less from X, 
and three possible sources of variation for Xrhc are obvious: the 
deviation of its row mean, its column mean, and its block mean 
from X. Now, if we recall the situation for double classification, 
it is fairly obvious that, when the score Xrtc is considered as be¬ 
longing in row r and column c, one source of variation becomes 
the remainder or interaction for rows and columns; considered 
next as also falling in row r and block 6, another source of varia¬ 
tion is the possible interaction of rows and blocks; and then 
thought of as belonging to column c and block 6, the score also 
involves the interaction of columns and blocks. 

When the sums of squares for these six components are added, 
it will be discovered that they do not sum to the sum of squares 
for the total; i.e., subtracting these six sums from the total sum 
leaves a remainder. This residual is sometimes referred to as 
error, more frequently as a triple interaction. This term involves 
rows, blocks, and columns. The reader, having in mind the idea 
that the simple row by column interaction has to do with the 
possible failure of cell entries to be consistent with the two sets of 
marginal means, must now try imagining that the RBCf entries 
in the cubical cells of our box may not be entirely consistent with 
the three sets of means on the edges and with the three sets on 
the surface. We have seen that a statistical check on simple 
interaction is not possible with only one entry per cell; similarly 



294 Analysis of Variance: Complex 

more than one score per cubicle is required for testing triple 
interaction. 

Table 54 gives the essentials, in symbols, for the analysis of 
variance for the triple classification setup. In order to specify 
the interactions, we here adopt the abbreviation scheme gen- 


Tahle 54- Variance Table for Triple Classification into R Rows, 
B Blocks, and C Columns 



Sum of Squares 

df 

Variance 

Estimate 

Rows 


R-1 

S*. 

Blocks 

RciiX.b. 

B - 1 


Columns 



s\. 

RXB inter¬ 
action 



S^rb 

RXC inter¬ 
action 



s\c 

BXC inter¬ 
action 

R^i{X.bc-X.b. 

(B-lXC-l) 

o 

S“bc 

RXBXC or 
triple 

interaction 

iii{Xrbc-Xrb.-Xr.c-X.bc 

+Xr..+X.b.+X..c-Xf 

(B-1)(B-1)(C-1) 


Total 


RBC-l 



erally used. Thus R X B, read R by JS, indicates the row and 
block interaction, and R X B X C stands for the row by block 
by column or triple interaction. In a given investigation, the 
rows, blocks, and columns • refer to particular independent or 
classificatory variables. 

It will be noted in Table 54 that the df for the triple interaction 
term is given as {R — 1)(B — 1)(C — 1). The student may be 
helped in understanding the reasoning which leads to this df by 
referring again to Fig. 18. The surface means tend to restrict 
the deviation score values within the box. How many cubical 
cells can we fill before these restrictions operate? The general 








Blocks for Persons or Matched Individuals 


295 


rule-of-thumb procedure for determining the df for interaction 
sums of squares is to take the product of the dfs of the variables 
involved in the given interaction. This holds for simple, triple, 
and higher-order interactions. 


SPECIAL CASE WHERE THE BLOCKS STAND FOR 
PERSONS OR MATCHED INDIVIDUALS 

Suppose the purpose of a study is to ascertain whether varia¬ 
tion on a dependent variable is influenced by or associated with 
variation on two independent variables. This, of course, involves 
the double classification idea previously discussed, but we are 
now in a position to accomplish, by means of triple classification, 
two closely related things which could not be done by the simpler 
double classification scheme. 

1. If transfer, practice, fatigue, etc., effects are such that it is 
permissible to make observations on an individual under each of 
the RC combinations of conditions, we may increase the precision 
of an experiment by using only m individuals instead of mRC 
individuals as in the illustration involving pursuit learning. Or 
we may make observations on mRC cases so as to have in each of 
the RC cells m scores which are based on m sets of matched indi¬ 
viduals, thereby reducing error. 

2. If we are dealing with a situation in which it is required that 
observations be made on the same individual in each of the RC 
conditions, and if more than one case is used either to reduce 
errors or to provide a basis for generalizing to a population, it is 
necessary that we make statistical allowance for the fact that the 
RC observations on the m cases are nonindependent, or correlated. 
This allowance was not possible by the double classification 
scheme, for which it was assiuned that the m scores in one cell 
were independent of the observations in the other cells. 

It will be recalled that in the double classification setup, by 
letting one classification refer to R individuals or sets of matched 
cases, we were provided with an over-all test of significance for 
several correlated means for groups classified on a single inde¬ 
pendent variable. Triple classification permits a similar test of 
correlated means for groups involved in double classification. We 
may let the B blocks stand for B individuals or sets of matched 
persons. The triple interaction sum of squares in Table 54 is a 



296 


Analysis of Variance: Complex 

remainder after the sums for rows, for columns, for blocks, and 
for the three simple interactions have been subtracted from the 
total sum of squares. In effect, this remainder represents the 
variation which is left over after an adjustment has been made 
for row, column, and block variations and for the three inter¬ 
actions. 

In order to understand better the meaning of the triple inter¬ 
action when the B blocks correspond to B individuals, we will 
consider the case in which we have 2 rows and 2 columns. In 
order further to simplify the exposition, we will assume that the 
2 row means are equal and that the 2 column means are also equal, 
or that the means along the vertical and bottom edges (Fig. 18) 
are equal to -X^, the total mean. We will further assume that there 
is no interaction between rows and columns, i.e., that each mean 
on the front surface equals X. By referring either to Fig. 18 or 
to Table 54, we see that these assumptions are equivalent to 
saying that Xr*. = = X, and Xr-c = X. 

When these identities are substituted in the expression for the 
triple interaction sum of squares, its value becomes 

SSS(X,6c - X,6. - Xfe, + X.6.)2 (104a) 

Now, when we let both r and c take on the explicit values of 1 
and 2, this sum can be broken down into 4 parts, a part for each 
of the 4 scores for an individual: 

S(XiM - - Xm + Xft.)* 

+ S(Xi(,2 — Xib- — X.62 + 

+ i(X2bl —^26. - Zbl + Z.b.f 

+ S(X262 - Zzb- -Zb2 + Zb.)^ (104b) 

Let us now look at the terms in the first of these sums. Each 
person has 4 scores, 2 assigned to each row and 2 to each column. 
Now Zib- is the mean of the 2 scores in the first row; X-bi is the 
mean of the 2 scores in the first column; and Z-b- is the mean of 



Blocks for Persons or Matched Individuals 297 

all 4 scores for individual b. We may therefore write the first 
sum of (1046) as 

^161 + Xib2 ^Ibl + X2bl 

2 2 

^ ^Iftl + ^162 + ^261 + ■X'262 
4 

which becomes 

•^S(4Xi6i - 2Xi6i - 2 X 1,2 - 2 X 1,1 

— 2X2,1 + + -2^162 + X2bi + X2,2)^ 

which simplifies to 

*i[(Xi,i - Xi,2) - (X2,i - X2,2)f 

With a similar replacement of means by appropriate scores, it 
can be shown that the other 3 parts or sums will have the same 
value as that derived for the first except for the sign \vithin the 
bracket, which is immaterial since the net value is squared. Hence, 
when we add the 4 parts, we get 

iS[(Xi,i - Xi,2) - (X2,1 - X2,2)]" (105) 

as the remainder or triple interaction sum of squares for the spe¬ 
cial case of 2 rows and 2 coliunns with row, column, and row by 
column interaction sums of squares assumed equal to zero. Note 
that the result involves the difference between the 2 differences 
of paired scores, the 2 scores in the first row, and the 2 scores in 
the second row. The scores within the bracket might also be 
grouped as (Xi,i — X 2 ,i) — (Xi ,2 — X 2 , 2 ), or the difference 
between the differences of colunm values. 

It will be recalled (p. 275) that, for the double classification 
scheme with 2 columns and with the rows as R individuals, the 
remainder sum of squares, involving the squares of the differences 
between paired scores, was an error term similar to that obtained 
by use of the standard error of the difference between correlated 
means. We have an analogous expression in the remainder (triple 
interaction) term for triple classification with the blocks corre¬ 
sponding to B individuals and with just 2 rows and 2 columns. 





298 Analysis of Variance: Complex 

In other words, this remainder term, when divided by its d/, 
becomes an error variance for testing the significance between 
correlated means in the more complicated situation where an 
observation is made on each individual under 2 by 2 or 4 combina¬ 
tions of conditions, and for testing significance when B sets of 
matched individuals (4 cases in each set) are used. 

Table 65, Symbolic and Numerical Scores for a First Individual 
Observed under Four Combinations op Conditions 


Symbolic Values Numerical Values 



1 

2 

Mean 


1 

2 

Mean 

1 

Xui 

XU2 

Xu. 

1 

8 

22 

15 

2 

X 211 

X 212 

X 21 . 

2 

16 

6 

11 

Mean 

X.ii 

X.12 

X.i. 

Mean 

12 

14 

13 


Perhaps a numerical example of the difference between differ¬ 
ences called for in formula (105) will be helpful. Let us take four 
scores for a first individual. Table 55 contains the symbolic and 
numerical scores along with the necessary means. Substituting 
these numerical values in formula (1046) gives 

(8 - 15 - 12 + 13)2 + (22 - 15 - 14 + 13)^ 

+ (16 - 11 ~ 12 + 13)2 + (6 ^ 11 - 14 + 13)2 

= (-6)2 + (+6)2 + (+6)2 + (-6)2 == 144 

Substituting in (105), we have 

J[(8 - 22) - (16 - 6)]2 = J[(-14) - (10)]2 = J576 = 144 

Since the foregoing reduction of the triple interaction sum of 
squares to a function of the difference between paired scores was 
accomplished by certain simplifying assumptions (row and column 
means equal and no row by column interaction), the reader may 
raise the question of whether the result is confined only to situa¬ 
tions where these assumptions hold. The answer is that the basic 
idea always holds. Actually, the general expression for the triple 
interaction sum of squares given in Table 54 incorporates adjust¬ 
ments for possible differences between row means and between 
column means and for possible interaction between rows and 
columns. Furthermore, the deduction that the triple interaction 
sum of squares is a function of the difference between an indi- 
viduaFs scores is true also in the situation where we have obscrva- 



Computational Illustration for Triple Classification 299 

tions classifiable into more than 2 rows and more than 2 columns. 
If, instead of the B blocks corresponding to B individuals, we 
have B sets of matched individuals, the triple interaction term 
becomes a function of the differences among the scores of the RC 
individuals within a set. All this is quite complicated, but it is 
hoped that from our discussion the student will have gained some 
insight concerning the possible use of the analysis of variance for 
testing the significance between means of correlated observations 
obtained by securing a measurement for each of B individuals 
under the RC combinations of conditions, defined by variations 
in the characteristics used to designate the rows and columns. 


COMPUTATIONAL ILLUSTRATION FOR TRIPLE CLASSIFICATION 

The task of computing the required sums of squares (see Table 
54) is tedious. The first step is to arrange the data in some such 
systematic order as that depicted in Table 53 and do the neces¬ 
sary adding to secure the various sums indicated in that table. 
The total sum of squares for all RBC cases is obtained as usual: 
sum all the scores, sum all the squared scores, and substitute in 
the general formula {l/RBC)[RBCi:X^ - (SX)^]. 

To secure the three between-groups and the three simple inter¬ 
action sums of squares, we form three subtables involving sums 
taken in various directions. For the first of these subtables we 
take row by colunm sums obtained by adding cell entries from 
block to block, i.e., through the B blocks. The next to the bottom 
section of Table 53 contains these row by column sums, which 
we reproduce here as Table 56a. The reader will note that the 
values for Table 566 are the right-hand margin sums of Table 53, 
and that the values for Table 56c are found as the sums in Table 
53 along the bottom of each block. 

With these auxiliary tables in mind, we can write the required 
computational formulas. The simple interaction terms are secured 
by computing a subtotal sum of squares for each table and then 
subtracting therefrom the two appropriate “between” sums of 
squares. These subtotal sums of squares will not be the same as 
the total sum of squares obtained for double classification by 
formula (102a) because we are now dealing with cell entries which 
are the sums of scores rather than single scores. Due allowance 
for this can be made by a slight change in formula (102a). The 
amended formula, with notation appropriate for and specific to 



300 


Analysis of Variance: Complex 
Table 66a, Required Sums for Row by Column Analysis 



1 

C 


Sum 

1 

2^161 

2X160 

2Xi6c 

b e 

22Xi6o 

r 

2Xrti 


6 

SXrbC 

6 e 

2SXrto 

R 

2Xj»i 

b 

SX/26C 

b 

^XEbC 

Z> c 

2SX«6o 

Sum 

r h 

2XXm 

r 6 

2SXrto 

r 6 

'ZiSXrbC 

r b e 

SSSXrto 


Table 66h. Required Sums for Row by Block Analysis 



1 

b 

B 

Sum 

1 

2Xuo 

2Xiio 

2Xibo 

b e 

22Xi6c 

r 

2Xrio 

2Xrto 

2X,bo 

b c 

XZXrho 

R 

2X«io 

2Xj86e 

^Xrbc 

b c 

XI.Xma 

Sum 

22Xrto 

22Xrt« 

iiXrBc 

r b c 

222Xrto 


Table 66c. Required Sums for Block by Column Analysis 



1 

c 

c 

Sum 

1 

2Xrti 

iXrU 

SXrlC 

r 0 

22X010 

b 

2Xrti 

2 Xrto 

£XriC 

22X060 

B 

2X,bi 

2X,bc 

SXrBC 

22XoBc 

Sum 

r b 

22Xo6i 

r b 

22X060 

r b 

22Xrtc 

■ 

r h c 

222Xo6c 






Computational Illustration for Triple Oassification 301 


the three auxiliary tables, may be written as follows: 
Subtotal: row by column 


^ RBC 
Subtotal: row by block 

1 r h c 


^ [RCzi(ixrbc? - iiiiXrbc?] (i06o) 


r h c 


RBC 

Subtotal: block by column 


- (SSSX.te)'*] (1066) 


^ [BciiiiXrbcf - (S2SX,5c)"] (106c) 

From the right-hand margin of either Table 56a or 565 we can 
compute the sum of squares for 


Between rows: [Ri(iiXrhc)^ - (iiiXric?] (106d) 

RBC 

From the bottom of either Table 56a or 56c we can obtain the 
sum of squares for 

Between columns: — [ci(iiXrbc)^ - {iiiXrbc?] (106c) 
RBC 

From the bottom of Table 566 or from the right-hand margin of 
66c we can calculate the sum of squares for 

Between blocks: — [Bi{iiXrbc)^ - (iiiXric?] (106/) 
RBC 

Then from the above six sums of squares the simple interaction 
sums of squares may be secured by the following subtractions: 

Row by column interaction: (106a) — (106d) — (106c) (107a) 

Row by block interaction; (1066) — (106d) — (106/) (1076) 

Block by column interaction: (106c) — (106c) — (106/) (107c) 

And finally, again by subtraction, we have the sum of squares 
for the row by column by block, or 

Triple interaction: Total sum of squares minus (106dc/) 
minus (107a6c). 



302 


Analysis of Variance: Complex 


Table 57. Data Used in Illustrating Computations for Triple Classi¬ 
fication: 2 Levels of Illumination (Rows), 3 Albedos (Columns), 
AND 4 Observers (Blocks) 


Observer 



Albedo 



Mean 

nation 

.07 

.14 

.26 

1 


11 

24 

60 

95 

31.67 


2.00 

14 

24 

65 

103 

34.33 


Sum 

Mean 


48 

24.0 

125 

62.5 

198 

33.00 

33.00 

2 

1.20 

22 

26 

44 

92 

30.67 


2.00 

27 

36 

47 

110 

36.67 


Sum 

Mean 

49 

24.5 

62 

31.0 

91 

45.5 

202 

33.67 

33.67 

3 

1.20 

16 

22 

55 

93 

31.00 


2.00 

18 

24 

62 

104 

34.67 


Sum 

Mean 

34 

17.0 

46 

23.0 

117 

58.5 

197 

32.83 

32.83 

4 


20 

32 

82 

134 

44.67 



24 

59 

84 

167 

55.67 


Sum 

Mean 

44 

22.0 

91 

45.5 

166 

83.0 

301 

50.17 

50.17 

Sums through 

mm 

69 

104 

241 

414 

34.50 

blocks 

Wm 

83 

143 

258 

484 

40.33 


Sum 

152 

247 

499 j 

898 

37.42 

Means for rows by 

1.20 

17.25 

26.00 

60.25 

34.50 


columns 

2.00 

20.75 

35.75 

64.50 

40.33 


Column means 


19.00 

30.87 

62.38 

37.42 



Data from R. E. Taubman, J. Exp. Psychol.^ 1945, 35, 235-241. 

















































Computational Illustration for Triple Classification 303 

We will illustrate the procedure by using the data of Table 57, 
in which the rows represent 2 levels of illumination, the columns 3 
degrees of albedo, and the blocks 4 individuals, and the scores 
are judged whiteness. Notice that each subject made judgments 
under all 6 of the combinations of conditions. The sums given 
in Table 57 become the entries for the auxiliary computational 

r b c 

Tables 58a6c. The needed value of ^lihXrbc is 898, and the siun 

r b c 

of all the squared scores, is 44,394. From these figures 

we have 

^[24(44,394) — (898)^] = 10,793.83 = total sum of squares 

The various ^‘between” sums can readily be obtained by adding 
the squares of the appropriate marginal sums of auxiliary Tables 
58a6c, and substituting in formulas (106de/). 

Table 58a. Required Sums for Row by Column Analysis 


Illumination 

AlbcHlo 

Sum 

.07 

.14 

.26 

1.20 

69 

104 

241 

414 

2,00 

83 

143 

258 

484 

Sum 

152 

247 

499 

898 


Table 68b. Required Sums for Row by Block Analysis 


Illumination 

Individuals 

Sum 

1 

2 

3 

4 

1.20 

95 

92 

93 

134 

414 

2.00 

103 

no 

104 

167 

484 

Sum 

198 

202 

197 

301 

898 





304 Analysis of Variance: Complex 

Table 58c. Required Sums for Block by Column Analysis 


Individual 

Albedo 

Sum 

.07 

.14 

.26 

1 

25 

48 

125 

198 

2 

49 

62 

91 

202 

3 

34 

46 

117 

197 

4 

44 

91 

166 

301 

Sum 

152 

247 

499 

898 


For between rows we need (414)^ + (484)^ = 405,652; 

For between columns we need (152)^ + (247)^ + (499)^ 
= 333,114; 

For between blocks we need (198)^ •+■ ( 202 )^ + (197)^ + (301)^ 
= 209,418. 

Then we have 

•5^[2(405,652) — (898)^] = 204.17 for between-rows sum of 

squares 

^[3(333,114) — (898)^] = 8039.08 for between-columns sum 

of squares 

•^[4(209,418) — (898)^] = 1302.83 for between-blocks sum 

of squares 

In order to secure the subtotal sums of squares we add the 
squares of the cell entries in the auxiliary tables. For the row 
by column subtotal we have from Table 58a: 

(69)2 ^ ( 33)2 ^ (104)2 ^ (143)2 ^ ( 241)2 ^ ( 253)2 ^ 167,560 

Similarly for the row by block subtotal we have from Table 586: 

(95)2 ^ (103)2 + . 4 . ( 167)2 ^ 105,508 

and for the block by column subtotal we have from Table 58c: 

(25)2 + • • • + (44)2 +... 4 . (166)2 ^ 37 314 




Computational Illustration for Triple Classification 305 

These three sums can now be substituted into formulas (lOGabc): 

•^[6(167,560) — (898)^] = 8289.83 = row by column subtotal 

sum of squares 

■^[8(105,508) — (898)^] = 1569.17 = row by block subtotal sum 

of squares 

•^[12(87,814) — (898)^] = 10,306.83 = block by column sub¬ 
total sum of squares 

Next we get the simple interaction sum of squares by the sub¬ 
tractions indicated in formulas (107a5c): 

8289.83 — 204.17 — 8039.08 = 46.58 = row by column in¬ 
teraction 

1569.17 - 204.17 - 1302.83 = 62.17 = row by block in- 

teraction 

10,306.83 - 8039.08 - 1302.83 = 964.92 = block by column 

interaction 

Then for the triple interaction sum of squares we have 

10,793.83 - 204.17 - 8039.08 - 1302.83 

- 46.58 - 62.17 - 964.92 = 174.08 

The several sums of squares, their dfSj and the resulting vari¬ 
ance estimates are brought together in Table 59. On the basis 
of the null hypothesis, ajl these estimates are of the same popula¬ 
tion variance. 


Table 59, Analysis op Variance for Judged Whiteness by 4 Observers 
FOR 3 Degrees of Albedo and 2 Levels of Illumination 


Source 

Sum of 
Squares 

df 

Variance 

Estimate 

Illumination 

204.17 

1 

204.17 

Albedo 

8,039.08 

2 

4,019.54 

Subjects (individual differences) 

1,302.83 

3 

434.28 

Interaction: / X A 

46.68 

2 

'23.29 

Interaction: 1 X S 

62.17 

3 

20.72 

Interaction: A X S 

964.92 

6 

160.82 

Interaction—triple: I X A X S 

174.08 

6 

29.01 

Total 

10,793.83 

23 




306 Analysis of Variance: Complex 

First we use the triple interaction as a basis for testing the 
significance of the simple interactions. Of chief interest in this 
example is the possible interaction between albedo and illumina¬ 
tion, but since this interaction variance is less than that for triple 
interaction, we know at once without computing F that the inter¬ 
action is insignificant. The illumination by individual interaction 
is also insignificant. The interaction of albedo with individuals 
yields an F of 160.82/29.01 = 6.54, which, for ni == 6 and n 2 = 6, 
falls between the values of 4.28 and 8.47 for the .05 and .01 levels 
respectively. This F of 5.54 is high enough to suggest that the 
form of the relationship between judged whiteness and albedo 
varies somewhat from person to person. 

Now we turn to a test of the main effects. A test of the signifi¬ 
cance of block differences is a test of individual differences and is 
accordingly of little interest. The F of 434.28/29.01 = 14.97 is 
significant beyond the .01 level of significance. For illumination 
we have F = 204.17/29.01 = 7.04, which falls a little beyond 
the 5.99 required for P = .05 and is therefore suggestive of a real 
difference due to illumination. 

For albedo we have an F of 4019.54/29.01, or 138.56, which is 
highly significant. Before accepting the conclusion that albedo 
has a significant effect on judged whiteness, we must consider the 
fact that the albedo by individual interaction approaches signifi¬ 
cance. This indicates that there may be another source of varia¬ 
tion in the data, and that the albedo by individual rather than the 
triple interaction variance should be used as our error or denom¬ 
inator term for F when we are testing albedo. Thus we have 
F = 4019.54/160.82 = 24.99, which, though not as large as the 
F of 138.55, nearly reaches the .001 level. If the I X S interaction 
had been significant, we would need to use its variance in testing 
the significance of illumination. 

Actually, the foregoing results are not to be regarded as con¬ 
clusive. The data which we’have used to illustrate the computa¬ 
tions are only a part of more complete data which involved addi¬ 
tional degrees of albedo and other levels of illumination. Partly 
because of space limitations and partly because it is easier to 
illustrate the computations when only a few rows, columns, and 
blocks are involved, we have ignored a part of the available data. 

It should be kept in mind that this illustration is an example 
of the use of the triple classification scheme as a method for mak- 



Triple Classification with m Cases per Cubicle 307 

ing allowance for the use of correlated observations in a problem 
of double classification involving the influence of two variables 
on a third. The triple interaction term is the basis for an error 
variance which allows for the fact that the means being compared 
are not based on independent observations. In this special use 
of triple classification, in which the blocks correspond to indi¬ 
viduals, the objective is identical with that in the earlier analysis 
of pursuit rotor learning (Table 52). The two situations are 
similar in that there are m (or B) scores in each cell of the row by 
column setup; they arc different in that the m scores in any one 
cell for the pursuit learning problem are independent of the m 
scores in other cells, whereas the B scores in each of the albedo- 
illumination cells are correlated—each person contributes a score 
to each cell. Both schemes permit a check on the interaction 
effect of the two independent variables used to classify the observa¬ 
tions into row groups and column groups. The use of RC obser¬ 
vations on each of B cases (if feasible) will yield more precise 
information than obtainable by having scores for m individuals 
in each of the RC cells. This is analogous to the well-known 
principle that experimentation in which individuals serve as their 
own controls tends to be more precise than that in which an inde¬ 
pendent control group is set up. 

TRIPLE CLASSIFICATION WITH m CASES PER CUBICLE 

We have seen how the possible association of a dependent 
variable with three independent variables can be tested by a 
variance analysis made on a triple classification basis. If one 
wishes either to base his results on more than RBC observations 
or to test the significance of the triple interaction, it is necessary 
to have more than one score in each cubicle. This can be accom¬ 
plished either by assigning m individuals to each of the RBC 
combinations of conditions or by using just m individuals with 
each yielding an observation under all the RBC conditions or by 
using m sets of RBC cases with one individual of each set assigned 
to each of the RBC groups. Matching may not be feasible; neither 
may the securing of RBC observations on each of m individuals 
be feasible. At times, however, the problem under consideration 
may require an observation on each individual under all the con¬ 
ditions. Whether m individuals are so used by preference or by 



308 Analysis of Variance: Complex 

necessity, we will have m measurements in each of the RBC cubi¬ 
cles, but in testing the significance of the differences between the 
means of rows or of columns or of blocks we will be dealing with a 
situation in which the means are correlated because they are 
based upon the same individuals. To allow for this fact we would 
need a quadruple classification setup. 

Let us next consider the case in which we have in each cubicle 
m scores, which are independent of the m scores in other cubicles. 
The total number of scores will, of course, be mRBC, and the 
breakdown of the total sum of squares will include the components 
specified in Table 54 plus a within-cubicles sum of squares. Since 
each cubicle defines a group, the within-cubicles sum of squares 
does not differ from previously discussed ‘‘within^ ^ sums of squares. 
The formula in this case is 

-[mSSSX^i^ - S(SX,6,)2] 

m 

in which it is understood that the term contains mRBC 
squares and that the subtractive term indicates that we first sum 
the m scores separately for each cubicle, then square each of these 
sums, and finally sum all these RBC squared sums. The df for 
this term will be mRBC — RBC because we are dealing with the 
deviations of mRBC scores about RBC different means. 

With m independent scores per cubicle, the six computational 
formulas (106) need only be modified by the use of XjmRBC in¬ 
stead of XjRBC as the factor outside the brackets. It must be 
understood, however, that the sums within the parentheses of 
formulas (106) will involve m times as many scores as for the 
simpler situation with one case per cubicle. The computation is 
again accomplished by auxiliary tables, the main cell entries of 
which will, of course, also involve sums with m times as many 
scores. If we think of the prderly arrangement of the original 
data, as exemplified in Table 53, it will be seen that each cell in 
the separate block designations will consist of m score entries; 
i.e., we will have m scores of the type Xm or X 324 . A more pre¬ 
cise notation would be to let Xiru stand for the score of the ith 
person in the rth row and cth column of the 6 th block, with i 
taking on values of 1 , 2 , • • • m. 

Except for the use of If mRBC in place of l/iZBC in formulas 
(106), the computation of the between and simple interaction 



Illustrations of Interaction 


309 


sums of squares follows exactly the steps outlined for a single 
score per cubicle. The triple interaction sum of squares is again 
obtained by subtraction, hut now we must also deduct the within- 
cubicles sum of squares. Note that in the formula of Table 54 
which defines the triple interaction term we need to replace Xrhc 
by Xrhcy the mean of the m scores in the rth row and cth column 
of block b. 

ILLUSTRATIONS OF INTERACTION 

Reference to actual examples of statistically significant interac¬ 
tion may help clarify its meaning. For this purpose we shall use 
some data on visual acuity from an experiment by Walker.* For 
visual acuity (low score, better acuity) by two methods of measure¬ 
ment (depth and vernier) with binocular and monocular vision, 
we have means as given in Table 60. The marginal means are 

Table 60. Visual Acuity: Interaction of Type op Measurement with 

Eyes 



Depth 

Vernier 

Total 

Binocular 

.08 

1.07 

.67 

Monocular 

.24 

1.50 

.87 

Total 

.16 

1.28 

.72 


markedly different, and it is readily seen that the cell means (each 
based on 108 determinations) are not consistent with the marginal 
values. The ratio of 1 to 3 for the binocular versus monocular 
means of .08 and .24 varies from the 2 to 3 ratio for the means on 
the right-hand margin; and the ratio of near 1 to 13 for the values 
of .08 and 1.07 differs from the 1 to 8 ratio of .16 to 1.28. In other 
words, the amount of difference between binocular and monocular 
acuity depends upon the type of measurement. 

One variable investigated in the experiment was the distance 
of the stimulus from the subject. Since distance is an ordered 
variable, it is possible to picture the interaction by making a 
graph, with acuity as the ordinate and distance along the x axis. 
Figure 19 shows the relationship of acuity (average of the two 
types of measures) and the three distances used. Note the dif- 

* Walker, E. L., Factors in vernier amity and distance discrimination, Un¬ 
published Doctor’s Dissertation, Stanford University, California, 1947. 



310 


Analysis of Variance: Complex 


ference between the two curves—the significant interaction for 
eyes and distance actually means that the two curves are differ¬ 
ent. This lack of parallel behavior of curves is more striking in 
Fig. 20, which illustrates the interaction of measures with dis- 



Distance Distance 

Fig. 19. Simple interaction for Fig. 20. Simple interaction for 
eyes with distance. measure with distance. 


tance, for binocular and monocular combined. In this study there 
was also a significant variance for the subjects by distance inter¬ 
action, from which one concludes that the relationship between 
acuity and distance varies from person to person (see Fig. 21). 



Distance Distance 


Fig. 21. Simple interaction for Fig. 22. Triple interaction for 
distance with subjects. eyes by distance by measure. 

To depict the meaning of a triple interaction involving ordered 
variables, a three-dimensional graph is required, but it is difficult 
to draw a graph which adequately illustrates that planes which 
represent the relationship between three variables are of different 






Tests and Generalizations 


311 


form for each variation on a fourth variable. The data of Walker 
yielded a significant triple interaction for eyes (binocular vs. 
monocular) and type of measurement and distance. Distance is 
the only ordered variable of the three independent variables in¬ 
volved in this triple interaction (the dependent variable, acuity, 
is of course an ordered variable). If we take acuity as the ordi¬ 
nate and distance as the abscissa and then plot the curves for 
vernier-monocular (VM), for vernier-binocular (VB), for depth- 
monocular (DM), and for depth-binocular (DB) acuity, we have 
Fig. 22. Perhaps a study of this figure along with the simple 
interactions (Table 60 and Figs. 19 and 20) will help one to under¬ 
stand the triple interaction for this situation. 

Any student who thumbs through the Journal of Experimental 
Psychology will find many examples in which the form of the 
relationship between two variables varies from condition to con¬ 
dition or varies with a third variable. Seldom will these have 
been labeled as interactions, mainly because the investigators 
have not been interested in significance tests. This is not to say 
that such tests should necessarily have been applied—^there are 
times when curves have been based on sufficient data to produce 
such marked regularity of progressive change in form as to require 
no statistical boost to one^s confidence regarding the presence of 
interaction. As mentioned at the beginning of this chapter, inter¬ 
actions were present in the results of psychological research long 
before the advent of th^ analysis of variance technique. The 
advantage of having the variance technique is that one can fre¬ 
quently, on the basis of either preliminary observations or few 
data, determine that an interaction is significant. Additional 
experimentation will usually be necessary to specify adequately 
the varying forms of the relationships or of the functions in¬ 
volved. 


TESTS AND GENERALIZATIONS 

The many and diverse situations for which the variance tech¬ 
nique is applicable and the various possible tests of significance 
and the permissible generalizations therefrom cannot be ade¬ 
quately presented in an introductory discussion. Perhaps a hypo¬ 
thetical example will lead to a better understanding of some of 
the principles already touched on in this chapter and also illus¬ 
trate additional ones. 



312 


Analysis of Variance: Complex 


Suppose that, in determining whether speed of reading is a 
function of style of type and of length of line, we use 4 differ¬ 
ent line lengths and 3 styles of type. To simplify matters, let us 
assume that we have an ample number of comparable 200-word 
passages for testing speed of reading, that we need not worry 
about fatigue or boredom or practice effects, and that compre¬ 
hension is somehow controlled. For such a study there are sev¬ 
eral experimental plans which are more or less feasible, some 
being obviously preferable to others. One could, of course, carry 
out independent experiments on the effect of line length and on 
the effect of style of type, i.e., 2 separate studies involving the 
single classification scheme. But, in order to take full advantage 
of the variance technique, we shall here consider possible plans 
for simultaneous investigation of the 2 variables. With 4 line 
lengths and 3 styles of type, we have 12 combinations of condi¬ 
tions. 

Plajj a. One svbject who is tested once under all 12 conditions. 
The sum of squares involves 3 components: between line lengths, 
between types, and a remainder. The variance estimate based 
on the remainder term can be used to test the significance of the 
primary or main effects. If either is significant, we can conclude 
only that the results hold for our 1 individual—^if we repeated the 
experiment on this person, we would expect amilar findings. No 
generalization beyond 1 person is possible. 

Plan B. One subject who is tested 5 times under each of the 12 
conditions. In addition to the component sums of Plan A, we now 
have a within-cells sum of squares, and the remainder term may 
be referred to as the T X L (type by line length) interaction. 
The within-cells variance, which is an error of measurement vari¬ 
ance, can be used to test this T XL interaction. If this interac¬ 
tion, which has to do with the 12 cell means, is significant, we can 
conclude that for our 1 person the effect of line length is different 
for the various styles of type, and we need to use the interaction 
variance in testing the main effects. However, if the interaction 
is insignificant, the main effects can be tested by using the within- 
cells variance, or we can combine the sums of squares, also the 
d/’s, for cells and for interaction, and thereby secure a new vari¬ 
ance estimate for testing the primary effects. Note that, when 
testing either the interaction or one of the main effects by means 
of the within-cells variance, we are actually raising the question 



Tests and Generalizations 


313 


of whether either the interaction or the primary variance is larger 
than expected on the basis of the variations due to measurement 
error. Plan B, like Plan A, does not permit any generalizations. 

Plan C. Twelve subjects assigned at random to the 12 conditions 
and tested once. This setup, like Plan A, involves 2 between vari¬ 
ances and a remainder variance. If either line length or style of 
type is significant, as judged by F with the remainder variance as 
denominator, we can generalize to the population from which the 
subjects were drawn. No test of interaction is possible. 

Plan D. Twelve subjects assigned at random to the 12 conditions 
and tested 5 times. The breakdown of the sum of squares is similar 
to that for Plan B: between line lengths, between types, a, T X L 
component, and within cells. The last is again a function solely 
of measurement error. First, let us consider the variance, s^th 
based on the T XL sum of squares. Now s^u, which it will be 
recalled involves the variation of the 12 cell means after adjust¬ 
ments for the marginal means, is a resultant of 3 sources of varia¬ 
tion: error of measurement, individual differences (each cell mean 
is for an individual), and possible interaction between line length 
and style of type. We have, in the within-cells variance, an esti¬ 
mate of the error of measurement component, If F = s^ti/s^e 
is significant, we have no way of knowing whether this is so be¬ 
cause of individual differences or because of real interaction. In 
other words. Plan D does not permit a test of the interaction 
because interaction is confounded with individual differences. 

Which variance should we use for testing the primary effects? 
We must use as the denominator of F a variance which includes 
all sources of variation, other than the main effects, that might 
contribute to the variation of the marginal means. We can be 
reasonably sure that both measurement errors and individual 
differences are sources of variation, and we know that interaction, 
if real, would be an additional source. Now, if the interaction 
were real, s^a would include all 3 sources, and it would therefore 
be the proper variance for us to use in testing the main effects; 
if the interaction were not real, s^Uy which includes the other 
2 sources of variation, would still be appropriate for testing the 
primaries. If the obtained value of s^n is not significant as judged 
by s^ti/^^ej we can combine the 2 sums of squares and their d/'s 
for the purpose of testing the main effects; rarely, if ever, will 
be insignificant since it involves individual differences. 



314 


Analysis of Variance: Complex 


Plan E. Ten subjects each of whom is tested once under all 12^^ 
conditions. The 120 scores for this plan require a triple classifica¬ 
tion: by line length, by style of type, and by individuals. Conse¬ 
quently, we will have 7 sums of squares: for line length (L), for 
type (T), for individuals or subjects (S), and for interactions 
T X Ly T X Sy and L X Sy plus a remainder. First, we would 
use the remainder variance to test the significance of the 3 simple 
interaction variances. Of particular interest is the T X L vari¬ 
ance; if its F is significant, we can generalize to the population 
represented by our 10 subjects. It is likely that the T X S and 
the L X S variances will prove significant since both involve 
individual differences. If so, we would use, as the error term for 
Fy s^ts when testing style of type and s^u when testing line length, 
and the results could be generalized. 

Plan F. Ten subjects each of whom is tested 5 times under all 12 
conditions. In addition to the 7 components of Plan E, we will 
now have a within-cubicles variance, s^e (measurement error). If 
the triple interaction variance, T X L X Sy called a remainder in 
Plan E, is significantly larger than it becomes the error term 
for testing the lower-order interactions; if the T X L X S variance 
is not significant, its sum of squares may be combined mth that 
for within cells in order to have an error term with a larger df for 
testing the main effects. But, since the within-cubicles variance 
itself has a large d/, 480, little will be gained by the combination. 
Aside from the possibility of testing the triple interaction, Plan F 
involves the same steps and the same possibilities for generaliza¬ 
tions as does Plan E. 

Plan G. Five subjects assigned at random to each of the 12 con- 
ditions and tested once. The breakdown of the total sum of squares 
will involve components for the primary variables, for T XL 
interaction, and for within cells (residual). We can use the 
within cells to test the interaction variance and generalize there¬ 
from to the population represented by the 60 subjects. Again, 
the variance to use as error for testing the main effects will depend 
upon the interaction F. Generalizations about the main effects 
are possible. 

What of the relative merits of these several plans? In addition 
to contrasts already made, let us proceed to further examination 
of certain of the alternatives. Note that Plan D involves securing 



Tests and Generalizations 315 

as many scores as Plan G; hence the two would probably require 
the same expenditure of the experimenter's time. Plan G provides 
no estimate of measurement error, but it does permit a test of 
possible T X L interaction. Since Plan G uses 5 times as many 
subjects, most investigators would place more confidence on 
results secured thereby. But it does not follow that Plan G, with 
60 subjects, is preferable to a plan like E, the general nature of 
which is not changed by using more than 10 subjects. For a given 
budget, greater precision from the viewpoint of the sampling of 
individuals can be attained by E than by any of the other plans. 
Unless it is known that the measurements are not very reliable, 
it is doubtful whether the securing of more than 1 measurement 
per subject per condition is worth while. For instance, rather 
than the 600 scores on 10 subjects, called for in Plan F, it would 
probably be more efficient to secure scores on 50 subjects, each 
tested just once under the 12 conditions. Precision in individual 
scores can be obtained by averaging several measurements, but 
this greater reliability will seldom compensate for failure to have 
a sizable number of subjects. So far as the author knows, no 
procedure is available for determining the optimum ratio of num¬ 
ber of subjects to the number of measurements per subject. 

This discussion of the several hypothetical plans illustrates a 
general principle which may with profit be repeated more explic¬ 
itly. If the triple interaction is significant, its variance becomes 
the proper error term for testing the significance of the simple 
interactions and also the main effects if the simple interactions 
are insignificant. If the triple interaction is not significant, one 
may use the residual variance in making these tests. If the triple 
interaction is insignificant (P greater than .05), it is permissible 
to combine its sum of squares and that for within cubicles with 
the sums for such simple interactions as are insignificant into a 
new sum of squares, the df of which is obtained by adding the 
d/’s for the sums being combined. The variance estimate based 
on this combined sum can then be used as the error term for 
another check of such simple interactions as were of doubtful 
significance when the residual variance estimate was used as the 
error term. The advantage of using an estimate based on such a 
combined sum of squares is that it mil be based on a larger df 
and therefore be more stable. If a simple interaction is signifi¬ 
cant, its variance should be used in testing the significance of the 



316 Analysis of Variance: Complex 

m 

main effects for the two variables involved. If one of the varia¬ 
bles in a simple interaction is individuals, it is very likely to be 
significant, and if so its variance becomes the error term for test¬ 
ing the significance of the main effects on the other variable. 

HIGHER-ORDER CLASSIFICATION 

There are times when it is both desirable and feasible to study 
the variations of a dependent variable associated with variations 
in more than 3 variables. For such a study the data are classifi¬ 
able in more than 3 ways. We have already mentioned the setup 
in which an observation is made on each of m individuals under 
each of the combinations of conditions defined by rows, blocks, 
and columns. There will be RBC scores for each individual, and 
the scores may be classified not only as belonging to a given row 
and a specified column of a particular block but also as belonging 
to a certain individual. Although it is easy to make an orderly 
arrangement of the data for quadmple classification, the required 
computations become somewhat burdensome. For the situation 
involving a fourth classification, based on either individuals or 
on a fourth independent variable, there will be 16 sums of squares: 
1 for total, 4 for between groups, 6 for simple interactions, 4 for 
triple interactions, and 1 for quadruple interaction. When 6 
classifications are used we will have sums of squares for: the 
total, 5 betweens, 10 simple* interactions, 10 triple interactions, 
5 quadruple interactions, and 1 fifth-order interaction. It is not 
within the scope of this book to outline the computations for 
these higher-order classifications. 

The possibilities of the variance technique as a method of ex¬ 
tracting from one set of data information regarding not only 
primary effects but also interactions have, at times, led to rather 
indiscriminate inclusions of variables. For instance, a classifica¬ 
tion of subjects as male or fehiale may be made in order to deter¬ 
mine possible sex differences. Since the typical experiment for 
which the variance technique is used is likely to be based on a 
relatively small number of subjects, it is very doubtful whether 
any information of value will be added to the sum total of the 
already inconsistent findings concerning sex differences. 

Those who carry out studies involving more than triple classifi¬ 
cation encounter great difficulty in interpreting significant higher- 



Higher-Order Classification 


317 


order interactions. Some have thought it safe, after ascertaining 
the sums of squares for the primaries and the simple and triple 
interactions, to use the remainder variance, which is a composite 
of imtested higher-order interactions, as an error term. Such a 
practice assumes insignificance for the interactions whose sums 
of squares are thus allowed to combine, but since there are in¬ 
stances of significant quadruple interaction, the cautious investi¬ 
gator will extract and test all the possible interactions before 
using such a remainder as the error term for F, 



CHAPTER 15 


Analysis of Variance: Covariance Method 


It is usually possible in experimentation to choose, either by 
random methods or by pairing or matching, groups that are com¬ 
parable on variables judged relevant to the comparisons to be 
made. There are times, however, when it is more practicable to 
use intact groups which may differ in important respects, and 
occasionally one may wish to make an unanticipated comparison 
which does not seem justifiable in light of known differences be¬ 
tween groups. Experimental control is the ideal, but, if this 
cannot be attained, one may resort to statistical allowances and 
thereby arrive at valid conclusions. 

Suppose that 2 intact groups are being used to evaluate the 
relative merits of 2 methods of memorizing and that the mean 
IQ is 105 for group A and 111 for group B. Now, if there is an 
appreciable correlation between the particular memorizing ability 
involved and intelligence, the results will need qualifying because 
of the difference in intelligence of the 2 groups. It would seem 
logical to use the regression equation, for estimating memory 
score from intelligence, as a basis for predicting how much of a 
difference in memorizing would arise because of the group dif¬ 
ference in IQ^s. Let us suppose that the mean memory perform¬ 
ance is 60 for group A and 70 for group B, and that substituting 
105 and 111 in the regression equation yields a predicted value of 
62 for group A and of 68 for group B. Thus our prediction would 
lead us to expect a difference of 6 points, and accordingly it would 
be said that 6 of the obtained difference of 10 could be attributed 
to lack of comparability of the 2 groups with respect to intelligence. 

The next question concerns the proper sampling error to use 
in evaluating the adjusted difference. It should be obvious that 
the ordinary procedure is inapplicable for the simple reason that 

318 



Analysis of Variance: Covariance Method 319 

we have tampered with the obtained means and in so doing have 
interfered somewhat with the operation of chance. 

It is the purpose of this chapter to give a precise method for 
making allowance for an uncontrolled variable and to set forth 
the sampling error adjustment which is needed in testing the 
statistical significance of the difference between “corrected’^ 
means. The method is applicable whenever it seems desirable 
to correct a difference on an experimental variable for a known 
difference on another variable which for some reason could not 
be controlled by matching or by random sampling procedures. 
Since the scheme about to be proposed has an analysis of variance 
setting, the reader can readily guess that it will provide an adjust¬ 
ment for, and a test of significance of, the differences between 
two or more groups, and that it will be usable for either large or 
small samples. It is assumed that the experimental variable has 
a distribution which does not depart too far from the normal 
type and that the variances from group to group are similar. 

In order to present the required adjustments, we need 
first to consider covariance^ which is defined as Xxy/N or 
2(X — X)(y' — 7)/N, The sum of products of deviations can 
be broken down into components in a manner similar to that 
used with a sum of squares. In the simplest situation we can have 
m pairs of X and Y scores in each of k groups. These pairs of 
scores can be recorded in some such fashion as that depicted in 
Table 61. Note that Xij and Yij stand for the X and Y values 

Table 61, Schema of Scores for Covariance 
Group 

1 2 j k 


Xu 

Yn 

Xl2 

Yi2 

Xiy 

Yii 

Xu 

Yik 

Xn 

Yn 

Xrt 

Yii 

Xii 

Yii 

Xu 

Yik 

X.1 

Ya 

Xa 

Yit 

Xii 

Yii 

Xu 

Xu 

X„i 

Y„i 

X„2 


Xmj 


X.U 

Xm* 


of the ith individual in the jth group. Note also that in allowing 
i to take on values running from 1 to m we do not imply any order 
for the individual, and that the ith individual in one group is in 
no sense paired with the ith case in another group. The product 
of the deviation scores for the ith individual in the ^th group 
would be {Xij — X){Yij — F), in which X and F are the 



320 Analysis of Variance: Covariance Method 

means for all km cases. The total sum of products would be 

— X)(Yij — Y). Now each deviation can be expressed 
in terms of two components in exactly the same way as in Chap¬ 
ter 13; i.e., one part is the deviation of the score from the mean 
of the group to which it belongs, and the other part is the devia¬ 
tion of the group mean from the total mean. Thus we have 

(Xii - j) = (Xij - Xi) + (X,- - X) 

and 

(Yii - F) = {Yij - F,) + (F,- - F) 

Then the above svim of the products becomes 

SS[(X.v - r,) + (Xi - J)][(y.v - F,-) + (F,- - F)] 

When the bracketed expressions are multiplied together, four 
terms result, and, since two of these vanish, we have left that 
the total sum of products is equal to 

SS(X.v - XMYij - F,) + mS(Z,- - X)i7i - F) 

The first of these terms involves a withivrgroxxps sum of products, 
whereas the second is for between groups. If there happens to be 
an unequal number of cases per group, the m of the second term 
goes under the summation sign as my. The degrees of freedom 
for the total sum of products is fcm — 1, or AT — 1, where N is 
the sum of the m/s; the dfs for the within and between terms are 
km — k (oT N -- k) and A; — 1 respectively. 

It will be of convenience to assemble in a table the sums of 
products, along with the sums of squares, for both the X and Y 
variables. These will be found in the first three lines of Table 62. 

Although we are here presenting the covariance technique as a 
method for making such adjustments as discussed in introducing 
this chapter, it is of interest to link covariance with the problem 
of correlation. The product moment correlation coefficient is 
usually defined as 

^xy 

T = - 

which may be written as 

Sxy Sly sex - X)(Y - F) 




Table 62, Setup for Analysis op Variance by Covariance adjustments 


Analysis of Variance 



Adjusted 2a* (fit - A*t/C() minus (B„, - A*„/Ct) equals adjusted Bi, 



322 Analysis of Variance: Covariance Method 

or as a function of a sum of products and two sums of squares. 
Using the sums of Table 62, we may specify three correlations: 
one based on the total sums, one based on the within sums, and 
one based on the between sums. These three correlations are 
indicated in line 5 by letters A, B, and C, with appropriate sub¬ 
scripts used to designate the several sums in the first three lines 
of the table. Line 5a gives the dfs for the r^s. 

Note that the between-groups r is actually the correlation 
between the X means and the Y means for the groups. If this r 
is significant, it follows that one source of the correlation for the 
total group is the heterogeneity resulting from the throwing 
together of groups with unlike means. (This between-groups 
correlation is meaningless when only two groups are involved. 
Why?) Stated differently, an appreciable between-groups r‘ indi¬ 
cates that the total r is spurious; this spuriousness is eliminated 
when r is computed from the within sums. The similarity of the 
within-groups r to the partial correlation coefficient will be recog¬ 
nized by the discerning student, especially if he recalls the deriva¬ 
tion of the latter. 

We now turn to the use of covariance as a basis for allowing 
for the influence of an uncontrolled variable on the differences 
between group means. The question here is not what the result 
would be if the uncontrolled variable were held constant, as in 
partial correlation, but rather what the result would be if the 
groups were made comparable with respect to the uncontrolled 
variable. Let X represent the experimental variable, and Y the 
uncontrolled variable. It is presumed that the Yj values differ, 
and that X is correlated with 7 in a linear fashion. For purposes 
of exposition we shall refer to Table 62, which will serve as an 
outline of the required computations. Line 6 of this table gives 
the regression coefficients (6*^) for predicting X from 7. Since 
no use will be made of Aft/C^, it is bracketed; it need not be com¬ 
puted. 

That these A/C values are regression coefficients can readily 
be demonstrated. In Chapter 7 the regression of X on 7 was 
given as 






Analysis of Variance: Covariance Method 


323 


Since, as we have seen above, 

T = j (jy, = N , and Cy = 

we have 

_ l^xy VSxViV 

“ V^VV ’ ^'^y^/N 


hxy _ A 


In order to make allowance for the uncontrolled differences in 
Fj, we need not only to adjust the Xj values but also to make an 
adjustment to the error term, which is used as the denominator 
of the F ratio in testing the difference between the adjusted X 
means. As in the simpler situation of Chapter 13, F will involve 
the ratio of a between-groups to a within-groups variance 
estimate. 

First, let us consider the method of making the adjustment to 
the total and to the within-groups variance estimates. The 
problem here is that of specifying how much of the variation in 
X can be predicted from variation in Y and then of subtracting 
this to secure the left-over variation as an adjusted value. But 
this left-over variance is nothing more than the residual variance, 
or square of the standard error of estimate, obtainable from for¬ 
mula (35): 

2 2 2 


Actually the adjustment is to be made to the sum of squares. In 
order to state the residual variance in terms of sums, we may 
substitute for and r^. Thus, 


hence. 


(^Lxy)^ 

If “ (Sx2)(S2/2) ’ N 


Na^xy = 


22 /" 


Since always equals a sum of squares, the value of Na^x y is 
obviously the sum of squares for the residuals. In the notation 



324 


Analysis of Variance: Covariance Method 


of this chapter, 

ikYij - F)2 

would be the residual sum of squares after the regression adjust¬ 
ment. This sum can be written as 



which is the entry for the total group in line 7 of Table 62. Simi¬ 
larly, the corresponding residual, or adjusted, sum of squares for 
within groups is — A^^y/C^. 

At first thought it would seem logical to adjust B 5 by the use 
of Ab and Cb, but the between-groups correlation (and regression) 
is affected by the differences between the X means, which arc 
the differences to be adjusted and then tested for statistical signifi¬ 
cance. Our adjustment should be one which is independent of 
the differences to be tested. This suggests that the regression 
for within groups, or kw/Cwj should be used since the regression 
for the total is also affected by the difference which we are out to 
test. In so far as we are concerned solely with the adjustment 
of the between-groups X means, the best adjustment would be 
by means of the within-groups regression. This could take the 
form of either an adjustment to the between-groups sum of squares 
for X or a direct adjustment to the several Xj values. 

Although the latter would be the best way of ascertaining how 
much of an effect the noncomparability of the groups with respect 
to Y had upon the X means, there is another consideration as to 
whether the within regression is appropriate for adjusting the 
between-groups sum of squares. It will be recalled that F is to 
be taken as the ratio of a variance estimate based on the between 
sum of squares to that based on within groups, and that the two 
variance estimates being so compared must be independent esti¬ 
mates. Now, if we adjust both the within and the between sum 
of squares by means of the same regression coeflScient (say, that 
based on within groups), any sampling error in this regression 
coefficient would have a similar effect on both adjustments; hence 
it could not be argued that the resulting adjusted sums of squares 


= SS(X« - X)2 



Analysis of Variance: Covariance Method 325 


possess the requisite independence. Therefore variance estimates 
based thereon would not be strictly independent. 

This difficulty is overcome by taking the adjusted sum of 
squares for between groups as the difference between the adjusted 
total sum and the adjusted within sum of squares. Thus, for the 
purpose of testing significance, 



leads to the proper adjustment for the between sum of squares 
for X, 

Perhaps the reader has anticipated that the dfs may change as 
a result of these manipulations. The new d/^s are recorded in 
line 8 of Table 62. Note that the df for the between sum has not 
changed since the adjustment was not made by using the between- 
groups regression. 

Aside from the usual methods for calculating sums of squares, 
we need formulas for computing sums of products in terms of 
raw scores. The following formulas are written for unequal rrij 
values, but are of course applicable for equal m^s. 


I 3 


ss(x,v - X)(Fo- - r) 


t 3 t 3 

SSXiiSSFi,- 


t 3 A a 

= SSX.iF.-,-- - - - for total ri08a) 


N 


i i 


SS(X,y - Xy)(F.v - Yj) 


i j i iXijiYij 

= SSXijFf/ — S- for within (1086) 


mi 


- X)(Fy - F) 


,or Ween (108c) 


rrii 


N 


Thus to compute the sums of products of deviations, we need 

i 3 

the sum of all N raw score products or HZXijYijj the sum of all 

I 3 i j 

the X*s or the sum of all the or XhYij, the sum of the 

t 

X^s separately for each group or and the sum of the F’s 



326 Analysis of Variance: Covariance Method 

i 

for each separate group or Adding the several X sums 

gives the sum of all the X's; likewise for F^s. Note that to get 
the second term of (1086), or the first term of (108c), we must 
divide the product of the two sums for a group by its m and then 
sum such quotients over all k groups. The reader may find some 
interest in comparing formulas (108) with formulas (98), and it 
should be apparent that in the case of equal m^s formulas (108) 
can be written in the simpler way of formulas (97). 

Table 63, Score Data and Sums Based on Raw Scores for Analysis 
OF Variance by Covariance Adjustments 





Group 




i 

f 1 

2 

3 



F 

X 

F 

X 

Y 

X 



14 

10 

11 

5 

7 

5 

SSX = 173 


9 

6 

9 

2 

6 

4 

22 y = 268 


11 

8 

8 

6 

2 

1 



12 

6 

10 

5 

10 

7 

22X* = 1161 


10 

9 

10 

4 

7 

9 

222* = 2642 


11 

7 

10 

8 

7 

4 



11 

9 

12 

10 

6 

5 

22X2 = 1688 


8 

5 

9 

6 

3 

2 



11 

6 

10 

4 

2 

2 

2(2X)* = 10,401 


12 

7 

11 

6 

9 

5 

2(22)* = 25,362 

Sum 

109 

73 

100 

56 

59 

44 

Y = 5.77 

Mean 

10.9 

7.3 

10.0 

6.6 

5.9 4.4 

7 = 8.93 

Sy*or 









1213 

657 

1012 

358 

417 

246 


XXY 

810 

571 . 


307 



The required computations are illustrated by using the data 
(fictitious) of Table 63, which contains F and X scores for 10 cases 
in each of 3 groups. The scores in each of the 6 columns are 
separately summed to yield 109, 73, etc. The scores are squared 
and summed to yield 1213, 557, etc. Summing the products of 
the X and F values gives 810, 571, and 307 for the 3 groups. 






Analysis of Variance: Covariance Method 327 

Summing over groups yields the double summations 173, 268, 
etc. Certain of these sums are then substituted into formulas 
(108) to secure the total, within, and between sums of products 
of deviations. By substituting the proper sums into formulas 
(97), we get the required sums of squares for the X^8 and for the 
F^s. Then these 3 sets of sums are entered as the first 3 rows of 
Table 64, which follows the pattern set forth in Table 62. 


Table 64- Analysis of Variance for X Variable of Table 63 by Co- 
variance Adjustments for Uncontrolled Y 



Total 

Within 

Between 

1. Sum of products 

142.53 

72.70 

69.83 

2. Sum of squares; X 

163.37 

120.90 


3. Sum of squares: F 

247.87 

105.80 

#•07 

4. df 

29 

27 

2 

5. Correlation 

.709 

.643 

.912 

5a. df for r 

28 

26 

1 

6. bxy value 

.5750 

.6871 


7. Adjusted 

81.42 minus 70.95 equals 10.47 

8. df 

28 

26 

2 


Before proceeding to the covariance adjustment, let us con¬ 
sider the means given in Table 63. It will be noticed that the 
groups differ considerably on X, or the experimental variable, 
and that they also differ on F, the relevant but not controlled 
variable. An analysis of variance based on the sum of squares 
for the X^s leads to a between-groups variance estimate of 
42.47/2, or 21.26, and a within-groups estimate of 120.90/27, or 
4.48. The F for testing the significance of the between-groups 
variance becomes 21.26/4.48, or 4.75, which for the given dfs 
is significant at about the .02 or .03 level of significance. This 
analysis does not, of course, allow for the fact that the groups 
differ on F. If there is correlation between X and F, the observed 
differences on X may be mainly a reflection of the group differences 
on F. As previously stated, the purpose of the covariance adjust¬ 
ment is to make statistical allowance for such uncontrolled 
differences. 






328 


Analysis of Variance: Covariance Method 


By following the steps indicated in Table 62, we determine the 
values in lines 5 to 7 of Table 64. Note that the adjusted 
for between groups, 10.47, is secured by subtracting 70.95 from 
81.42. The analysis of variance based on the adjusted sums of 
squares (for the X^s) gives a between-groups variance estimate 
of 10.47/2, or 6.23, and a within-groups estimate of 70.95/26, or 
2.73. Then F = 5.23/2.73 = 1.92, which for 2 and 26 degrees 
of freedom yields a P of about .20. Accordingly, it cannot be 
concluded that there are significant group differences on X over 
and above those which would be expected because of the differ¬ 
ences on Y. 

It should be obvious that the use of the covariance adjustment 
method must be justified by logical and experimental considera¬ 
tions. When it is logical to control a variable by pairing or match¬ 
ing, thpu the covariance adjustment is defensible as a way of 
makin^jiroper allowance for a failure, because of infeasibility, to 
control the variable. The use of the covariance adjustment is 
not predicated on the degree of correlation between the experi¬ 
mental and the uncontrolled variable. If the correlation is rela¬ 
tively low, the adjusted values will differ but little from the 
unadjusted values; if high, both the total and within adjusted 
variances will differ considerably from the unadjusted variances, 
but, as we shall presently see, the extent to which the adjusted 
and unadjusted between-groups variances differ is not solely a 
function of the correlation. 

It is of interest to make an actual adjustment of the X means 
of Table 63 for the group differences on Y. The adjustments can 
be made by 

Xja = X/ - - F) 

in which Xya is the adjusted value for the jth group, and hxy is 
the w^AtTi-groups regression coefficient. For the data of Table 63 
we have 

ha = 7.30 - .687(10.90 - 8.93) = 5.95 

Xaa = 5.60 - .687(10.00 - 8.93) = 4.86 

Xsa = 4.40 - .687(5.90 - 8.93) = 6.48 

Should the reader be surprised that the adjustment puts group 3 
ahead, he should ponder the fact that, relative to the withirir 
groups X and Y variances, the third group’s X of 4.40 was not as 



Analysis of Variance: Covariance Method 329 

far below the means of the other two groups as was its F of 5.90. 

From a careful consideration of the foregoing, it will be seen 
that the covariance adjustment method will not necessarily reduce 
the differences between the means on the experimental variable. 
Situations arise in which groups that show marked differences on 
some correlated but imcontrolled variable may yield similar 
means on the variable being studied. Suppose that we are using 
two intact groups to investigate the relative merits of two learn¬ 
ing methods, and that the initial means of the two groups are 
markedly different. We would, accordingly, expect a difference 
on final standing even though the two methods were equally 
efficacious. If this expected difference is not found, it follows 
that the method used by the group with the lower initial score 
was more effective in that this group overtook the other group. 
With groups differing on an uncontrolled variable, it is not only 
as proper, but also as necessary, to use the covariance tljpmique 
when the groups are nearly the same on the experimental variable 
as when they are different. For such situations the adjustment 
will increase the between-groups variance. The adjusted variances 
are sometimes referred to as ^‘rcduced^’ variances, but it follows 
from the above that tliis term may be a misnomer for the adjusted 
hetweenrgYoxx^^ variance. 

The extent to which the adjusted variances lead to a level of 
significance different from that based on an analysis of the un¬ 
adjusted values will obviously depend upon three things: the 
degree of correlation between the experimental and uncontrolled 
variable, the size of the differences between the groups on the 
uncontrolled variable, and the found differences on the experi¬ 
mental variable. The applicability of the covariance technique 
does not depend upon a minimum degree of correlation or upon a 
definite amount of group differences on the uncontrolled variable. 
But, if the within-groups correlation is low and/or there is only a 
small, chance difference between the groups on the uncontrolled 
variable, the use of the covariance adjustment may not be worth 
the effort. Obviously, if a variable correlates near zero with the 
experimental variable, it need not be controlled experimentally 
or statistically. 

The covariance technique can be extended in two directions. 
If the groups differ on two or more relevant variables, an adjust¬ 
ment for the variances on the experimental variable can be made 



330 


Analysis of Variance: Covariance Method 


by multiple regression. When we have a double classification 
scheme with RC groups of m cases, each of which may be assigned 
to R rows and C columns, the technique can be extended to pro¬ 
vide adjusteS variance estimates for between rows, for between 
columns, for the R X C interaction, and for within groups. Since 
these extensions are complicated and of rare utility, they will not 
be treated in this introductory discussion. 



CHAPTER 16 

Notes on Sampling and Statistical Inference 


The problems of statistical inference involve mathematical 
formulas for estimating the sampling errors needed in various 
practical situations. Many of the formulas and the situations in 
which they are applicable have been discussed in Chapters 5, 
11 , and 12 and elsewhere, and the ramifications of the analysis 
of variance technique have been presented in Chapters 13^ to 15. 
We shall now consider some additional aspects of the problem of 
sampling and statistical inference.* 

SAMPLING TECHNIQUES 

In considering the specific methods of drawing a sample in such 
a way as to avoid bias, we must differentiate between two types 
of situations: (1) All the units or individual members of a given 
universe are already catalogued or on file with more or less infor¬ 
mation of some kind already known concerning the universe; or 
(2) no file is available, and little is known about the universe 
except what has been inferred from previous samples. The first 
is exemplified by the universe of telephone subscribers or of auto¬ 
mobile owners or by the school population of a city, and the 
second by the typical universe dealt with in field surveys and 
investigations, such as the public opinion polls. 

Sampling methods, as used, may be classified under five head¬ 
ings: accidental, random, purposive, stratified or quota, and area 
or block. These will be discussed in the above order with more 
attention given the fourth method. 

*See McNemar, Q., Sampling in psychological research, Psychol. BuU.y 
1940, 37 , 331-365. Parts of this paper have been reproduced here, by per¬ 
mission of the editor of the BvUeiin. This article also contains a bibliography 
on sampling. 


331 



332 Notes on Sampling and Statistical Inference 


Accidental sampling. Despite the fact that psychologists seem 
to use the method of accidental sampling more than any other, 
it has nothing to recommend it either on statistical or on scientific 
grounds. Its very ease and simplicity have, no doubt, led to its 
wide use. This method is essentially nothing more than its name 
implies: the accidental choice of individuals for the sample. Any 
individual who is available and can be corralled into service be¬ 
comes a subject. The method has its corollary in the haphazard 
and accidental manner in which many universes are chosen for 
study. In fact, the available subjects may not have been chosen 
as representing any defined universe but used to define a pos¬ 
teriorly the universe being sampled. It is thus that the college 
sophomore becomes the raw stuff out of which psychologists build 
a science of human behavior. Aside from the failure of the char¬ 
acteristics of sophomores to be typical of the generality of man¬ 
kind, o^e must also remember that the lowly sophomore is of a 
decidedly different species from institution to institution. Even 
though we grant that the college sophomore is typical of man¬ 
kind, certain accidental factors affect the likelihood of any one 
individual's inclusion in a sample of sophomores. His cooperation 
must be secured, and, what may be more important in personality 
studies, his chance of representing Homo sapiens is increased if 
his interest in himself and his own personality adjustment has 
led him to take elementary psychology. 

Random sampling. By the method of random sampling it is 
fairly easy to arrive at a representative sample, provided the uni¬ 
verse has already been catalogued. Thus, if one wishes a repre¬ 
sentative sample of school children of a certain grade in a city, 
he can secure it by a purely mechanical scheme, such as taking 
every nth card from the files. Although this type of systematic 
sampling does not exactly satisfy the conditions of random sam¬ 
pling, it will assure a random sample unless the cards have been 
systematically arranged in other than alphabetical order. 

A psychologist will find little consolation in the thought that 
there are mechanical schemes for drawing a random sample, since 
files seldom exist for the universes with which he deals. The use 
of the random method for sampling an uncatalogued population 
involves so many diflBculties in psychological research that no 
specific schemes are to be found in the literature. The Literary 
Digest straw polls rested on the assumption that the population 



Stratified Sampling 333 

of telephone and car owners was not different in its voting prefer¬ 
ence from the entire population of potential voters. This hap¬ 
pened to hold before 1936, so that replies to ballots mailed at 
random to telephone and car owners forecasted fairly accurately 
the election results. The failure in 1936 is attributed to an align¬ 
ment of voting to income levels. 

Purposive method. This method depends upon the selection 
of groups which, together, yield the same averages or proportions 
as does the whole imiverse with respect to those quantities or 
qualities which are already a matter of knowledge. If the variables 
under study are related to the known factors, the samples (groups 
taken together) will be typical of the whole. It should be noted 
that all the individuals in the several groups are used, that the 
sampling unit is the group, that the efficacy of the method depends 
upon the degree of relationship between the criterion variables 
and the characteristic being studied, and, therefore, that its use 
is contingent upon considerable foreknowledge. Since the method 
has not found much favor and since it is not particularly adaptable 
for psychological sampling, we will give it no further consideration. 

Stratified sampling. In the stratified method, sometimes 
called the quota method, one or more individuals are pulled at 
random from each of several strata, the number in the sample 
from each stratmn being proportional to the universe number in 
the stratum, and the strata are predetermined by knowledge of 
some control variable or Variables. Psychologists who sample so 
as to secure proportionate representation from the several occupa¬ 
tional levels are, in reality, using the stratified method. It should 
be obvious that the method can be used for either catalogued or 
uncatalogued universes, provided information is available on 
some variable or variables which permits their use in setting up 
the strata. Common-sense reasoning and mathematical treat¬ 
ment agree in showing that this method gives more reliable results 
than the purely random method, provided the experimental 
variable is related to the stratifying variables. For example, if 
we had information on some universe with regard to the heights 
of the individuals, nothing would be gained by using height as a 
means of setting up strata for the purpose of dra\\dng a sample 
from which to infer the IQ's of the group. Such a procedure 
would not lead to better (or worse) results than would be obtained 
by the random method. 



334 Notes on Sampling and Statistical Inference 

It might be anticipated that the error formulas for stratified 
sampling will differ from the ordinary formulas since the condi¬ 
tions of sampling are essentially different. A consideration of the 
formulas will indicate (1) that they are different, (2) that greater 
precision results from stratified sampling, and (3) that there are 
limiting factors as to the possible increase in precision. The 
following formulas hold for large samples. 

When one is sampling for attributes by the stratified method, 
the standard error of an obtained proportion, P, is given, in terms 
of information yielded by the sample, approximately by 



where P equals the proportion in the total sample, N, who possess 
the attribute, Q = 1 — P, and <r\ is the weighted variance of 
the several strata proportions about the sample value P, or 

4 - P)" + • • • + n,{p, - P)2] 

N 

where ni, n 2 , etc., are the numbers of cases, and pi, p 2 , etc., the 
proportions, in the several strata, there being k strata in all. A 
casual examination of formula (109) shows that the magnitude 
of the error is less for a stratified sample than for ordinary sam¬ 
pling, and that the increase in precision depends upon one’s ability 
to stratify the universe in such a way as to secure strata which 
are really different with regard to the attribute being studied. 

The formula for the standard error of the mean when the sam¬ 
ple has been secured by the stratified method has been variously 
stated. The variance (standard error squared) of the mean is 
given by 

= — (cr^ - (110) 

N 

where X = the sample mean, = sample variance, and 
the variance of the means of the several strata about the total 
sample mean. An exactly equivalent form is 


a^X = 


N 


[ni(.^i—.3r)^+n2(X2—• •+nA;(Xfc—X)^] 
N 


(110a) 



Stratified Sampling 335 

where ni, W2, • • • are the numbers, and 1 ^2» • • • the means in 
the separate strata. Expression (110a) states explicitly that the 
term of (HO) involves weighting each stratum mean by the 
sample number of cases in the stratum. 

If stratification has been accomplished by the use of a variable, 
[7, which is linearly related to the variable being studied, the 
formula can be written in the form 

— f^xT^xu) ( 111 ) 

N 

If one prefers, he can compute the standard deviations separately 
for the several strata distributions of the variable being studied 
and use these to arrive at the standard error of the mean by sub¬ 
stituting in 

( 112 ) 

or its equivalent 

+ n 2 <T ^2 H-+ nyr^k) (112a) 

It matters little which of these formulas is used in practice, 
except that (111) is not as general as the others, but it can be 
made so by substituting the proper for r. Perhaps form (110) 
is the most practicable. 'Regardless of the form, it will be noticed 
that stratified sampling does lead to greater precision in the sense 
of a smaller chance error, but only when the control or stratifying 
variable is related to the variable being studied. This is explicit 
in (111) and directly implied in formulas (110) and (110a); i.e., 
the means differ from stratum to stratum, and form (112) indi¬ 
cates that the increase in precision, if any, is due to a greater 
homogeneity, for the variable being studied, within the several 
strata than for the total sample. These are but three slightly 
different ways of regarding the same thing. 

One can also deduce from the above formulas that stratifica¬ 
tion on the basis of a variable, U, for studying variable X may not 
lead to an improved sample for studying some other variable, 7, 
unless U and Y are correlated. If several variables are used in 
stratifying, the correct standard error formula involves substi¬ 
tuting in formula (111), in the place of the multiple correlation 



336 Notes on Sampling and Statistical Inference 

between X and the control variables. Since the multiple correla¬ 
tion coefficient increases slowly as more variables are added, it 
follows that the gain in precision which results from using more 
than two or three control variables may be very small. 

The applicability of the stratified method depends, of course, 
upon a priori knowledge of the universe with regard to possible 
control characteristics, and its advantage is contingent upon the 
additional condition that the variable being investigated is related 
to the possible control variables. Often information is lacking 
on this latter point, so that the investigator must rely on judg¬ 
ment as to what variable or variables will make profitable con¬ 
trols. At the present time the characteristics which can be utilized 
as controls in stratified sampling in psychology are few in num¬ 
ber: socioeconomic or occupational status, urban-rural residence, 
geographical factors, age, sex, racial or national origin, education, 
and perhaps intelligence. Despite the limitation of the stratified 
method of sampling, its use offers psychologists one of the best 
available schemes for drawing a representative sample. In addi¬ 
tion to yielding a possibly greater precision, the method should, 
perhaps, tend to the elimination of bias. 

Area sampling. Recent developments in the Bureau of the 
Census indicate that an excellent way for securing an unbiased 
sample is by the area or block method. This method depends 
largely upon known characteristics of the population to be sam¬ 
pled, particularly as these are related to large geographical regions 
and small subdivisions thereof. Since the method is so expensive 
as to preclude its use except by governmental or the largest private 
polling agencies, we will not give details here.f 

SAMPLING AND THE USE OF EXPERIMENTAL AND 
CONTROL GROUPS 

We have already discussed (Chapter 5) some of the advantages 
of using paired individuals for experimental and control groups. 
It was pointed out that the r term in the square of the standard 
error of the difference formula 

o^D = — 2ri2(r3ri<rx, (26) 

t For further discussion see Hansen, M. W., and Hauser, P. M., Area sam¬ 
pling—some principles of sample design, PuU, Opin. Qmri,, 1945, 9 , 183-193. 



Sampling for Experimental and Control Groups 337 

is a necessary and suflScient allowance for the fact that the two 
means being compared are not based on independent samples. 
If more than two nonindependent groups are being compared, a 
special case of the double classification scheme of the analysis of 
variance permits an over-all test of the significance of the differ¬ 
ence between the several means, with due allowance for their 
lack of independence (see pp. 274-276). 

In the planning of research wherein an experimental and a 
control group are essential, it is well to consider rather carefully 
the benefits to be derived from control of variables likely to be 
sources of error, by way of pairing or matching with respect to 
these variables vs. depending upon randomization as a method of 
controlling them. The pairing of individuals as a method of 
equating an experimental and control group with respect to rele¬ 
vant variables—^those which might be related to the experimental 
variable—^has long been recognized as sound experimental method. 
Any found difference between the two samples cannot be explained 
as due to a difference between the groups in regard to the variables 
so controlled. The ideal experimental situation would be attained 
if all the variables likely to affect the difference between the groups 
on the experimental variable were controlled in the sense of being 
equated for the two groups. But so little is known about the 
interdependence of psychological variables that this ideal can 
never be achieved. It follows, therefore, that, regardless of how 
carefully we equate groups on the basis of certain variables, there 
will be other variables of more or less importance upon which the 
groups might differ, and the hope is that by the principle of ran¬ 
domization no greater than chance differences between the experi¬ 
mental and control groups mil exist for these unknown variables. 

Consequently, one may very well ask whether there is an advan¬ 
tage in equating by pairing. The answer is yes, provided one^s 
knowledge and intuition are such that out of the variables avail¬ 
able for pairing one can select those which are really pertinent. 
If one is fortunate enough in pairing to create an interpair correla¬ 
tion as high as .75 on the experimental variable, the standard 
error of the difference between means will be reduced by one-half. 
To accomplish this increase in precision by using larger groups 
would involve quadrupling the original numbers. An r of .50 will 
increase the precision as much as will doubling N. On the other 
hand, if equating does not lead to pair correlation on the experi- 



338 Notes on Sampling and Statistical Inference 

mental variable, one has evidence that the pairing scheme, 
regardless of how elaborate, yielded neither a statistical nor an 
experimental advantage. There remains only the psychological 
satisfaction of knowing that certain variables were controlled. 

If the experimenter has no hunch as to what variables should 
form the basis for pairing, he must depend solely upon the princi¬ 
ple of randomization, which, in the t 3 T)ical situation, consists of 
dividing a given group randomly into halves and taking one-half 
for the experimental, the other for the control, group. If the 
available group of individuals can be catalogued in some fashion, 
it can always be split into halves by some mechanical scheme, 
thus assuring randomness with regard to all the known or unknown 
characteristics of the individuals. If the experimental cost per 
individual is such that fairly large numbers can be utilized, this 
scheme of randomly splitting a group into two subgroups as 
experimentals and controls has much to recommend it. The 
experimentalist may object to this by saying that he prefers not 
to trust chance or luck to 3 deld groups which are comparable on 
what he thinks are pertinent variables. In this connection it is 
important to remember that randomization by mechanical schemes 
will never lead to more than a chance difference between the 
groups on relevant variables, and, since the difference for any 
variable is purely chance, one cannot expect the difference to have 
more than a chance effect on the result for the experimental 
variable. If, for example, a chance, i.e., nonsignificant statisti¬ 
cally, difference exists in the initial mean reading scores of two 
groups, this difference, in and of itself, will not lead to a significant 
difference in the relative extent to which they will profit from two 
diverse methods of teaching improvement in reading. The sam¬ 
pling error formula is adequate for evaluating such chance phenom¬ 
ena. 

It should be noted here that an original group which is split 
into halves either at random or by pairing must be regarded as 
representative of some defined universe, and that such conclusions 
as are drawn from the experiment cannot be generalized unless it 
can be shown that the defined universe is representative of the 
generality of mankind with respect to the variable being studied. 
In other words, those who persist in using the college sophomore 
as a laboratory representative of mankind have not avoided, by 
showing that selective factors did not render their experimental 



. Sampling for Experimental and Control Groups 339 

and control groups noncomparable, the necessity of bridging the 
gap between the sophomore’s behavior and that of the typical 
human being. 

It is interesting to consider the use of experimental and control 
groups in the light of a formula for the standard error of the dif¬ 
ference between means derived by Wilks. J Let X be the variable 
imder study and Y a possible variable for control; then, if the 
individuals of each group have been so selected as to yield identical 
distributions on the matching or control variable, Y, the sampling 
variance of the difference between the means on the X variable 
will be given by 

<^D = - i^xv) (113) 

where Vxy is the correlation between the experimental and control 
variables. If the matching has been made on the basis of several 
control variables, the given correlation becomes the multiple 
correlation between the experimental variable and the matching 
variables. There are two important aspects of this which deserve 
mention. 

The first is that the variance of the difference can also be written 
in the form 

(113a) 

from which we deduce the following important fact: where two 
groups have been separately matched as to distribution on the 
same control variable, the standard error of the difference can be 
obtained without the restriction of the ordinary pairing procedure, 
which requires that there be an equal number of cases in the two 
groups. This holds true, also, when several control variables have 
been used. The reader will note that either term in the above 
formula, (113a), is, as might be expected, identical to formula 
(111) for the sampling error where the stratified method is used. 
Formula (113a) is particularly useful when the cost per case is 
much greater in the experimental group than in the control group. 
Precision can be secured by taking a larger control group. (This 
procedure, taking a larger control group, can obviously be followed 
if the groups are not equated except by randomization.) 

t Wilks, S. S., On the distribution of statistics in samples from a normal 
population of two variables with matched sampling of one variable, Metron, 
1932, 9 , 87-126. 



340 Notes on Sampling and Statistical Inference 

The second significant fact is that, when the in the experi¬ 
mental and control groups are equal and the distributions on the 
control variables for the two groups have been matched or the 
groups have been equated by pairing on the basis of these same 
control variables, it can be shown algebraically that the correla¬ 
tion in formula (26) between pairs on the experimental variable 
is equal to the square of the correlation in formula (113a). Thus, 
formula (26) may be written in the form 

<f^D = - ^^xv<^Xi<rXi (114) 

where Vxy is not the correlation between pairs on the experimental 
variable, but the correlation between the experimental and con¬ 
trol variable. When the groups have been equated on two or 
more variables, the r needed in formula (114) is the multiple r 
between the experimental and the control variables. 

Formula (114) makes explicit what we have already said— 
namely, that the control variable or variables must be related to 
the experimental variable in order that the equating of groups by 
pairing or matching will result in a statistical, hence an experi¬ 
mental, advantage. Furthermore, the efficacy of using additional 
controls is somewhat limited by the well-known fact that the 
increase in the multiple correlation coefficient resulting from 
adding more variables is usually slow. That this phenomenon 
of diminishing returns, associated with the problem of multiple 
correlation, should be operative here may come as a surprise. 

This multiple r must be .866 to diminish the sampling error by 
one-half, and .707 to lead to a reduction in error equivalent to 
that obtained by doubling the size of the samples. It is not our 
purpose to discourage the practice of equating experimental and 
control groups, but the student should realize that such procedures 
do not always lead to any marked advantage over the random 
method. In so far as greater precision can be obtained and selec¬ 
tive factors avoided, the equating of groups is worth while. It 
must be remembered, however, that, despite the matching of 
pairs on some variables, there are likely to be other variables of 
equal importance upon which the groups will be no more com¬ 
parable than expected on the basis of randomization. 

As is well known, one of the most efficient experimental designs 
is the use of the individuals of a group as their own control. The 
performance of a group of individuals is determined for two dif- 



• Sampling for Experimental and Control Groups 341 

ferent experimental conditions; and the resulting change, in¬ 
crease or decrease, in the behavior is interpreted as being due to 
the differences in conditions, provided such factors as practice 
effects, fatigue, and memory have been taken into account. Such 
a setup does not involve the question of the comparability of two 
groups, but the individuals used must be regarded as a sample of 
some definitive universe, so that the end result must be evaluated 
in terms of sampling in order to have some estimate of the likely 
fluctuation which would occur if the experiment were repeated on 
another sample of the same size. 

Another method of securing comparable groups is to select 
control individuals who are consanguineous to the individuals in 
the experimental group. Examples are the split-litter technique, 
the use of siblings as controls, and the method of co-twin control. 
In so far as the variation in the experimental variable is influenced 
by genetic and environmental factors, the use of identical twins 
represents the best possible method of securing comparable groups 
for experimental purposes. The possible advantage of using twins 
in one field of investigation has been pointed out by Student^^ 
in liis 1931 paper § on the Lanarkshire milk experiment in England. 
This investigation involved the daily feeding of three-fourths of a 
pint of raw milk to 5000 children and of an equal amount of pas¬ 
teurized milk to another group of 5000 over a period of four 
months. These 10,000, plus a control group of 10,000, were 
measured for height and weight at the beginning and end of the 
four-month period. Despite large numbers, the groups were not 
comparable as regards initial height and weight, the operating 
selective factor being the benevolent attitude of school teachers 
who apparently thought the research project would not be harmed 
if preference was given frail, undernourished children in choosing 
individuals for the feeder groups. Either a carefully supervised 
random, or a definite pairing, procedure would have, of course, 
avoided this selective bias, but what is more important and more 
relevant to our present topic is ‘‘Student’s^^ claim, so far not 
refuted, that the use of 50 pairs of identical twins would have 
3 rielded as precise information at only 2 per cent of the cost of the 
original experiment, or at a saving of approximately $35,000. 

§ ^'Student,” The Lanarkshire milk experiment, Biomeirika, 1931, 23, 398- 
406. 



342 Notes on Sampling and Statistical Inference 


EVALUATION OF CHANGES 

Typically, in the experimental study of changes we have an 
initial test, followed by a provided experience and then by a final 
or posttest for an experimental group; and an initial and final 
test without the interpolated experience for a control group. Or 
we may have a pretest-posttest sequence for two groups with 
different interpolated experiences. The statistical evaluation of 
the net changes, i.e., the difference between the two groups, can 
be made by the methods discussed in Chapter 5. It will be recalled 
that we may deal either with the mean changes or with the dif¬ 
ference between the differences between initial and final means. 
Let 

De = ^fE — XiE = change or difference for experimentals 
Dc = Xfc — ^ic = change or difference for controls 
Then 

D = Be - 5c = (XfE - XiE) - (Ifc - ^ic) 

represents the net change, the change shown by the experimentals 
corrected for the change shown by the controls. We may rear¬ 
range the yet maintain the numerical value of IDe — Dcy 
as follows: 

D = (XfE - Xfc) - (XiE - Xic) 

from which it is seen that the net change may also be thought of 
as the final difference between the two groups corrected for their 
initial difference. Such a correction involves the assumption that 
each unit of difference in initial standing will produce a unit of 
difference in final standing. In other words, this type of adjust¬ 
ment implies a one-to-one relationship between initial and final 
scores. Since a perfect correlation is never found or even ap¬ 
proached in practice, one may question whether the usual pro¬ 
cedure of comparing changes is really defensible. 

It is, of course, entirely logical that group differences on the 
final scores, which we may call the experimental variable, should 
be corrected for group differences on initial standing, the uncon¬ 
trolled variable. The covariance adjustment technique of Chap¬ 
ter 15 is based on predictions by means of regression equations, 
and accordingly it provides a way of correcting final means for 



, Relationship of F, t, X ^9 and Normal Distributions 343 

initial differences, with due allowance for the degree of correlation 
between initial and final scores. Now the ordinary and the 
covariance methods of testing the significance of gains differ not 
only in the correction or adjustment to final means, but also in 
the resultant sampling error. The ordinary technique uses a 
standard error which definitely includes, either explicitly or im¬ 
plicitly, the variance for both initial and final scores and the 
correlation of initial with final, whereas the error term used in the 
covariance method is a direct function of the degree of correlation 
and of the variance of the final scores only. In other words, the 
net differences being tested are not the same, and neither are the 
error terms the same. In general the two methods will not lead 
to the same level of significance for a given comparison. 

Which method is preferable? The student who is interested in 
an answer to this question will wish to read Chapter IX of 
R. A. Fisher^s Design of experiments,\\ SuflBce it to say here that 
Professor Fisher discusses different types of corrections and then 
proceeds to use the covariance technique. 

RELATIONSHIP OF F, f, NORMAL DISTRIBUTIONS 

In Chapter 11 it was shown that a with one degree of freedom 
is the same as a (D/<7£>)^, or (CR)^, In Chapter 12 it was seen 
that for dfs larger than 30 the value of t can be interpreted as a 
normal curve deviate—a^ df becomes larger and larger, the error 
involved in treating ^ as a CR becomes less and less. In Chapter 13 
we had several instances for which F is the same as when 
ni = 1. It would therefore seem that there is a systematic con¬ 
nection between the four distributions which are used in testing 
significance. 

It can be shown that the normal, the and the t distributions 
are special cases of the F distribution; i.e., the three can be de¬ 
duced from the equation of F, When we take rii = 1 and F = 
the equation of F can be transformed to that of t with df or n = n 2 . 
If we set ni = 1, 712 = oo, and F = (x/a)^, the equation,for the 
normal curve will, with proper manipulations, emerge. If we set 
712 == 00 , allow Til to vary, and let F = x^/^i, we can get the x^ 
distribution with df or n = ni. Thus x^ = ^iF = nF where n 
is the df for x^- 

II Fisher, R. A., Design of experiments, London: Oliver and Boyd, 1942. 



344 Notes on Sampling and Statistical Inference , 

Since F is the general distribution, it is possible if necessary 
for one to get along with just the table for F, If we desire to know 
whether i reaches the .01 level, we need only enter under 
^ 1 = 1 , opposite the 712 corresponding to the df for the /; if we 
wish to know whether a CR reaches the .001 level, we square CR 
and look under ni = 1 and 712 = «>; if we need to judge the sig¬ 
nificance of a based on n degrees of freedom, we can enter the 
F table with F = opposite 7^2 = «> and under ui = n. 
Although F contains the other three distributions and although 
as regards tests of significance the normal curve no longer occupies 
the conspicuous place it once did, there remains the fact that 
underlying F, and x^ we have an assumption of normality. For 
both F and t it is assumed that the universe distribution for the 
variable being sampled is normal, and for x^ it is assumed that 
sampling frequencies form a normal distribution about the ex¬ 
pected frequency. It cannot be argued that any one of these 
distributions is more important than the others. Each has a 
proper place in statistical inference. 



APPENDIX 

Tables A to F 



sssss 8 


Appendix 


Table A, Normal Curve Functions 


2 or x/ff Area: m to 2 

.00 .00000 

.05 .01994 

.10 .03983 

.16 .05962 

.07926 

.09871 
.11791 
.13683 
.15542 
.17364 

.19146 
.20884 
.22575 
.24215 
.25804 

.27337 
.28814 
.30234 
.31594 
.32894 

.34134 
.36314 
.36433 
.37493 
.38493 

.39435 
.40320 
.41149 
.41924 
1.45 .42647 


Area: q Smaller y or Ordinate 


.50000 

.3989 

.48006 

.3984 

.46017 

.3970 

.44038 

.3945 

.42074 

.3910 

.40129 

.3867 

.38209 

.3814 

.36317 

.3752 

.34458 

.3683 

.32636 

.3605 

.30864 

.3521 

.29116 

.3429 

.27425 

.3332 

.26768 

.3230 

.24196 

.3123 

.22663 

.3011 

.21186 

.2897 

.19766 

.2780 

.18406 

.2661 

.17106 

.2541 

.15866 

.2420 

.14686 

.2299 

.13667 

.2179 

.12507 

.2059 

.11507 

.1942 

.10565 

.1826 

.09680 

.1714 

.08851 

.1604 

.08076 

.1497 

.07363 

.1394 


346 



Appendix 


347 


Table A. Normal Curve Functions {Continued) 


z or x/<r 

Area; m to 2 

1.50 

.43319 

1.65 

.43943 

1.60 

.44520 

1.65 

.45053 

1.70 

.45543 

1.75 

.45994 

1.80 

.46407 

1.85 

.46784 

1.90 

.47128 

1,95 

.47441 

2.00 

.47725 

2.05 

.47982 

2.10 

.48214 

2.15 

.48422 

2.20 

.48610 

2.25 

.48778 

2.30 

.48928 

2.35 

.49061 

2.40 

.49180 

2.45 

.49286 

2.50 

.49379 

2.55 

.49461 

2.60 

.49534 

2.65 

.49598 

2.70 

.49653 

2.75 

.49702 

2.80 

.49744 

2.85 

.49781 

2.90 

.49813 

2.95 

.49841 

3.00 

.49865 

3.25 

.49942 

3.50 

.49977 

3.75 

.49991 

4.00 

.49997 


Area: q Smaller y or Ordinate 


.06681 

.1295 

.06057 

.1200 

.05480 

.1109 

.04947 

.1023 

.04457 

.0940 

.04056 

.0863 

.03593 

.0790 

.03216 

.0721 

.02872 

.0656 

.02559 

.0596 

.02275 

.0540 

.02018 

.0488 

.01786 

.0440 

.01578 

.0396 

.01390 

.0355 

.01222 

.0317 

.01072 

.0283 

.00939 

.0252 

.00820 

.0224 

.00714 

.0198 

.00621 

.0175 

.00539 

.0154 

.00466 

.0136 

.00402 

.0119 

.00347 

.0104 

.00298 

.0091 

.00256 

.0079 

.00219 

.0069 

.00187 

.0060 

.00159 

.0051 

.00135 

.0044 

.00058 

.0020 

.00023 

.0009 

.00009 

.0004 

.00003 

.0001 



348 


Appendix 



Table B, 

Transformation of r to z 


r 

z 

r 

z 

r 

z 

.01 

.010 

.34 

.354 

.67 

.811 

.02 

.020 

.35 

.366 

.68 

.829 

.03 

.030 

.36 

.377 

.69 

.848 

.04 

.040 

.37 

.389 

.70 

.867 

.05 

.050 

.38 

.400 

.71 

.887 

.00 

.060 

.39 

.412 

.72 

.908 

.07 

.070 

.40 

.424 

.73 

.929 

.08 

.080 

.41 

.436 

.74 

.950 

.09 

.090 

.42 

.448 

.75 

.973 

.10 

.100 

.43 

.460 

.76 

.996 

.11 

.110 

.44 

.472 

.77 

1.020 

.12 

.121 

.45 

.485 

.78 

1.045 

.13 

.131 

.46 

.497 

.79 

1.071 

.14 

.141 

.47 

.510 

.80 

1.099 

.15 

.151 

.48 

.523 

.81 

1.127 

.16 

.161 

.49 

.536 

.82 

1.157 

.17 

.172 

.50 

.549 

.83 

1.188 

.18 

.181 

.51 

.563 

.84 

1.221 

.19 

.192 

.52 

.577 

.85 

1.256 

.20 

.203 

.53 

.590 

.86 

1.293 

.21 

.214 

.54 

.604 

.87 

1.333 

.22 

.224 

.55 

.618 

.88 

1.376 

.23 

.234 

.56 

.633 

.89 

1.422 

.24 

.245 

.57 

.648 

.90 

1.472 

.25 

.256 

.58 

.663 

.91 

1.528 

.26 

.266 

.59 

.678 

.92 

1.589 

.27 

.277 

.60 

.693 

.93 

1.658 

.28 

.288 

.61 

.709 

.94 

1.738 

.29 

.299 

.62 

.725 

.95 

1.832 

.30 

.309 

.63 

.741 

.96 

1.946 

.31 

.321 

.64 

.758 

.97 

2.092 

.32 

.332 

.65 

.775 

.98 

2.298 

.33 

.343 

.66 

.793 

.99 

2.647 



Appendix 

Table C, Transformation op z to r 


349 


z 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

.0 

.0000 

.0100 

.0200 

.0300 

.0400 

.0500 

.0599 

.0699 

.0798 

.0898 

.1 

,0997 

.1096 

.1194 

.1293 

.1391 

.1489 

.1586 

.1684 

.1781 

.1877 

.2 

.1974 

.2070 

.2165 

.2260 

.2355 

.2449 

.2543 

.2636 

.2729 

.2821 

.3 

.2913 

.3004 

.3095 

.3185 

.3275 

.3364 

.3452 

.3540 

.3627 

.3714 

.4 

.3800 

.3885 

.3969 

.4053 

.4136 

.4219 

.4301 

.4382 

.4462 

.4542 

.5 

.4621 

.4699 

.4777 

.4854 

.4930 

.5005 

.5080 

.5154 

.5227 

.5299 

.6 

.5370 

.5441 

.5511 

.5580 

.5649 

.5717 

.5784 

.5850 

.5915 

.5980 

.7 

.6044 

.6107 

.6169 

.6231 

.6291 

.6351 

.6411 

.6469 

.6527 

.6584 

.8 

.6640 

.6696 

.6751 

.6805 

.6858 

.6911 

.6963 

.7014 

.7064 

.7114 

.9 

.7163 

.7211 

.7259 

.7306 

.7352 

.7398 

.7443 

.7487 

.7531 

.7574 

1.0 

.7616 

.7658 

.7699 

.7739 

.7779 

.7818 

.7857 

.7895 

.7932 

.7969 

1.1 

.8005 

.8041 

.8076 

.8110 

.8144 

.8178 

.8210 

.8243 

.8275 

.8306 

1.2 

.8337 

.8367 

.8397 

.8426 

.8455 

.8483 

.8511 

.8538 

.8565 

.8591 

1.3 

.8617 

.8643 

.8668 

.8692 

.8717 

.8741 

.8764 

.8787 

.8810 

.8832 

1.4 

.8854 

.8875 

.8896 

.8917 

.8937 

.8957 

.8977 

.8996 

.9015 

.9033 

1.5 

.9051 

.9069 

.9087 

.9104 

.9121 

.9138 

.9154 

.9170 

.9186 

.9201 

1.6 

.9217 

.9232 

.9246 

.9261 

.9275 

.9289 

.9302 

.9316 

.9329 

.9341 

1.7 

.9354 

.9366 

.9379 

.9391 

.9402 

.9414 

.9425 

.9436 

.9447 

.9458 

1.8 

.9468 

.9478 

.9488 

.9498 

.9508 

.9518 

.9527 

.9536 

.9545 

.9554 

1.9 

.9562 

.9571 

.9579 

.9587 

.9595 

.9603 

.9611 

.9618 

.9626 

.9633 

2.0 

.9640 

.9647 

.9654 

.9^61 

.9668 

.9674 

.9680 

.9686 

.9693 

.9699 

2.1 

.9704 

.9710 

.9716 

.9722 

.9727 

.9732 

.9738 

.9743 

.9748 

.9753 

2.2 

.9757 

.9762 

.9767 

.9771 

.9776 

.9780 

.9785 

.9789 

.9793 

.9797 

2.3 

.9801 

.9805 

.9809 

.9812 

.9816 

.9820 

.9823 

.9827 

.9830 

.9834 

2.4 

.9837 

.9840 

.9843 

.9846 

.9849 

.9852 

.9855 

.9858 

.9861 

.9864 

2.5 

.9866 

.9869 

.9871 

.9874 

.9876 

.9879 

.9881 

.9884 

.9886 

.9888 

2.6 

.9890 

.9892 

.9894 

.9897 

.9899 

.9901 

.9903 

.9904 

.9906 

.9908 

2.7 

.9910 

.9912 

.9914 

.9915 

.9917 

.9919 

.9920 

.9922 

.9923 

.9925 

2.8 

.9926 

.9928 

.9929 

.9931 

.9932 

.9933 

.9935 

.9936 

.9937 

.9938 

2.9 

.9940 

.9941 

.9942 

.9943 

.9944 

.9945 

.9946 

.9948 

.9948 

.9950 


♦ Table C is abridged from Table VII of Fisher and Yates: Staiistical tables 
for biological^ agricultural and medical research, Oliver and Boyd, Ltd., Edin- 
burgh, by permission of the authors and publishers. 



350 


Appendix 


Table D. Distribution of * 


n 

P « .99 

.98 

.95 

.90 

.80 

.70 

.50 

1 

.00016 

.00063 

.0039 

.016 

.064 

.15 

.46 

2 

.02 

.04 

.10 

.21 

.45 

.71 

1.39 

3 

.12 

.18 

.35 

.58 

1.00 

1.42 

2.37 

4 

.30 

.43 

.71 

1.06 

1.65 

2.20 

3.36 

5 

.55 

.75 

1.14 

1.61 

2.34 

3.00 

4.35 

6 

.87 

1.13 

1.64 

2.20 

3.07 

3.83 

5.35 

7 

1.24 

1.56 

2.17 

2.83 

3.82 

4.67 

6.35 

8 

1.65 

2.03 

2.73 

3.49 

4.59 

5.53 

7.34 

9 

2.09 

2.53 

3.32 

4.17 

5.38 

6.39 

8.34 

10 

2.56 

3.06 

3.94 

4.86 

6.18 

7.27 

9.34 

11 

3.05 

3.61 

4.58 

5.58 

6.99 

8.15 

10.34 

12 

3.57 

4.18 

5.23 

6.30 

7.81 

9.03 

11.34 

13 

4.11 

4.76 

5.89 

7.04 

8.63 

9.93 

12.34 

14 

4.66 

5.37 

6.57 

7.79 

9.47 

10.82 

13.34 

15 

5.23 

5.98 

7.26 

8.55 

10.31 

11.72 

14.34 

16 

5.81 

6.61 

7.96 

9.31 

11.15 

12.62 

15.34 

17 

6.41 

7.26 

8.67 

10.08 

12.00 

13.53 

16.34 

18 

7.02 

7.91 

9.39 

10.86 

12.86 

14.44 

17.34 

19 

7.63 

8.57 

10.12 

11.65 

13.72 

15.35 

18.34 

20 

8.26 

9.24 

10.85 

12.44 

14.58 

16.27 

19.34 

21 

8.90 

9.92 

11.59 

13.24 

15.44 

17.18 

20.34 

22 

9.54 

10.60 

12.34 

14.04 

16.31 

18.10 

21.34 

23 

10.20 

11.29 

13.09 

14.85 

17.19 

19.02 

22.34 

24 

10.86 

11.99 

13.85 

15.66 

18.06 

19.94 

23.34 

25 

11.52 

12.70 

14.61 

16.47 

18.94 

20.87 

24.34 

26 

12.20 

13.41 

15.38 

17.29 

19.82 

21.79 

25.34 

27 

12.88 

14.12 

16.15 

18.11 

20.70 

22.72 

26.34 

28 

13.56 

14.85 

16.93 

18.94 

21.59 

23.65 

27.34 

29 

14.26 

15.57 

17.71 

19.77 

22.48 

24.58 

28.34 

30 

14.95 

16.31 

18.49 ' 

20.60 

23.36 

25.51 

29.34 


• Table D is abridged from Table IV of Fisher and Yates: Statistical tables 
for hiologicalf agricultural and medical research^ Oliver and Boyd, Ltd., 
Edinburgh, by permission of the authors and publishers. 


COCOCO COCOCOCOOT OTCOCOOTOT COCOCOCOCO COCOCOCOCO 



ggsisss sssijs; Sooomo* <^^o>to 


Appendix 


351 


n 

1 


Table D. Distribution of *— (Continued) 


.30 

1.07 

2.41 

3.66 

4.88 

6.06 

7.23 

8.38 

9.52 

10.66 

11.78 

12.90 

14.01 

15.12 

16.22 

17.32 

18.42 

19.51 

20.60 

21.69 

22.78 

23.86 

24.94 

26.02 

27.10 

28.17 

29.25 

30.32 
31.39 

32.46 
33.53 


.20 

1.64 

3.22 

4.64 

5.99 
7.29 

8.56 

9.80 
11.03 

12.24 
13.44 

14.63 

15.81 

16.98 

18.15 

19.31 

20.46 

21.62 

22.76 

23.90 
25.04 

26.17 

27.30 

28.43 

29.55 

30.68 

31.80 

32.91 
34.03 

35.14 

36.25 


.10 

2.71 

4.60 

6.25 

7.78 

9.24 

10.64 

12.02 

13.36 

14.68 

15.99 

17.28 

18.55 

19.81 
21.06 

22.31 

23.54 

24.77 

25.99 

27.20 

28.41 

29.62 

30.81 
32.01 

33.20 
34.38 

35.56 
36.74 

37.92 
39.09 

40.26 


.05 

3.84 

5.99 

7.82 

9.49 

11.07 

12.59 
14.07 

15.51 

16.92 

18.31 

19.68 
21.03 
22.36 

23.68 
25.00 

26.30 

27.59 

28.87 

30.14 

31.41 

32.67 

33.92 

35.17 

36.42 

37.65 

38.88 
40.11 
41.34 

42.56 

43.77 


.02 

5.41 

7.82 

9.84 

11.67 
13.39 

15.03 

16.62 

18.17 

19.68 
21.16 

22.62 

24.05 

25.47 

26.87 

28.26 

29.63 

31.00 

32.35 

33.69 
35.02 

36.34 

37.66 

38.97 

40.27 

41.57 

42.86 

44.14 

45.42 

46.69 
47.96 


.01 

6.64 
9.21 

11.34 

13.28 
15.09 

16.81 

18.48 

20.09 

21.67 

23.21 

24.72 

26.22 

27.69 

29.14 

30.58 

32.00 

33.41 

34.80 

36.19 

37.57 

38.93 

40.29 

41.64 
42.98 

44.31 

45.64 
46.96 
48.28 

49.59 

50.89 


.001 

10.83 

13.82 
16.27 

18.46 

20.52 

22.46 

24.32 

26.12 

27.88 

29.59 

31.26 
32.91 

34.53 

36.12 

37.70 

39.25 

40.79 

42.31 

43.82 

45.32 

46.80 

48.27 
49.73 

51.18 

52.62 

54.05 

55.48 

56.89 

58.30 

59.70 


* Table D is abridged from Table IV of Fisher and Yates: Statistical tables 
for biologicalf agricultural and medical research^ Oliver and Boyd, Ltd., 
Edinburgh, by permission of the authors and publishers. 



352 


Appendix 


Table E, Distribution ov t* 


71 

p = .1 

.05 

.02 

.01 

.001 

1 

6.314 

12.706 

31.821 

63.657 

636.619 

2 

2.920 

4.303 

6.965 

9.925 

31.598 

3 

2.353 

3.182 

4.541 

5.841 

12.941 

4 

2.132 

2.776 

3.747 

4.604 

8.610 

5 

2.015 

2.571 

3.365 

4.032 

6.859 

6 

1.943 

2.447 

3.143 

3.707 

5.959 

7 

1.895 

2.365 

2.998 

3.499 

5.405 

8 

1.860 

2.306 

2.896 

3.355 

5.041 

9 

1.833 

2.262 

2.821 

3.250 

4.781 

10 

1.812 

2.228 

2.764 

3.169 

4.587 

11 

1.796 

2.201 

2.718 

3.106 

4.437 

12 

1.782 

2.179 

2.681 

3.055 

4.318 

13 

1.771 

2.160 

2.650 

3.012 

4.221 

14 

1.761 

2.145 

2.624 

2.977 

4.140 

15 

1.753 

2.131 

2.602 

2.947 

4.073 

16 

1.746 

2.120 

2.583 

2.921 

4.015 

17 

1.740 

2.110 

2.567 

2.898 

3.965 

18 

1.734 

2.101 

2.552 

2.878 

3.922 

19 

1.729 

2.093 

2.539 

2.861 

3.883 

20 

1.725 

2.086 

2.528 

2.845 

3.850 

21 

1.721 

2.080 

2.518 

2.831 

3.819 

22 

1.717 

2.074 

2.508 

2.819 

3.792 

23 

1.714 

2.069 

2.500 

2.807 

3.767 

24 

1.711 

2.064 

2.492 

2.797 

3.745 

25 

1.708 

2.060 

2.485 

2.787 

3.725 

26 

1.706 

2.056 

2.479 

2.779 

3.707 

27 

1.703 

2.052 

2.473 

2.771 

3.690 

28 

1.701 

2.048 

2.467 

2.763 

3.674 

29 

1.699 

2.045 

2.462 

2.756 

3.659 

30 

1.697 

2.042 

.2.457 

2.750 

3.646 

40 

1.684 

2.021 

2.423 

2.704 

3.551 

60 

1.671 

2.000 

2.390 

2.660 

3.460 

120 

1.658 

1.980 

2.358 

2.617 

3.373 

00 

1.645 

1.960 

2.326 

2.576 

3.291 


* Table E is abridged from Table III of Fisher and Yates: Statistical tables 
for biological^ agricultural and medical research, Oliver and Boyd, Ltd., 
Edinburgh, by permission of the authors and publishers. 



Appendix 


353 


Table F, Table op F por .05 (roman), .01 (italic), and .001 (bold face) 
Levels op Signipicancb * 



1 

2 

3 

4 

5 

6 

8 

12 

24 

00 


161 

200 

216 

225 

230 

234 

239 

244 

249 

254 

1 

40518 

4999 

6403 

6626 

6724 

6859 

5981 

6106 

6234 

6.366 


406284 

600000 

640879 

662600 

676406 

686987 

698144 

610667 

628497 

686619 


18.51 

19.00 

19.16 

19.25 

19.30 

19.33 

19.37 

19.41 

19.45 

19.50 

2 

98.49 

99.01 

99.17 

99.26 

99.30 

99.33 

99.36 

99.42 

99.46 

99.60 


998.6 

999.0 

999.2 

999.2 

999.3 

999.8 

999.4 

999.4 

999.6 

999.6 


i 10.13 

9.55 

9.28 

9.12 

9.01 

8.94 

8.84 

8.74 

8.64 

8.53 

3 

S4.1S8 

30.81 

29.46 

28.71 

28.24 

27.91 

27.49 

27.06 

26.60 

26.12 


167.6 

148.6 

141.1 

137.1 

134.6 

182.8 

180.6 

128.8 

126.9 

183.6 


7.71 

6.94 

6.59 

6.39 

6.26 

6.16 

6.04 

5.91 

5.77 

5.63 

4 

Ml.BO 

18.00 

16.69 

16.98 

16.62 

16.21 

14.80 

14.37 

13.93 

13.46 


74.14 

61.26 

66.18 

63 . U 

61.71 

60.68 

49.00 

47.41 

46.77 

44.06 


6.61 

6.79 

5.41 

5.19 

5.05 

4.95 

4.82 

4.68 

4.53 

4.36 

5 

16. MB 

13.27 

12.06 

11.39 

10.97 

10.67 

10.27 

9.89 

9.47 

9.02 


47.04 

86.61 

88.20 

81.09 

29.76 

28.84 

27.64 

86.48 

86.14 

23.78 


5.99 

5.14 

4.76 

4.53 

4.39 

4.28 

4.15 

4.00 

3.84 

3.67 

6 

IS.74 

10.92 

9.78 

9.16 

8.76 

8.47 

8.10 

7.72 

7.31 

6.88 


86.61 

27.00 

28.70 

21.90 

80.81 

80.08 

19.08 

17.99 

16.89 

16.76 


5.59 

4.74 

4.35 

4.12 

3.97 

3.87 

3.73 

3.57 

3.41 

3.23 

7 

1M.M6 

9.66 

8.46 

7.86 

7.43 

7.19 

6.84 

6.47 

6.07 

6.65 


89.28 

81.69 

18.77 

17.19 

16.21 

16.68 

14.68 

13.71 

18.78 

11.69 


5.32 

4.46 

4.07 

3.84 

3.69 

3.58 

3.44 

3.28 

3.12 

2.93 

8 

11.26 

8.66 

7.69 

'7.01 

6.63 

6.37 

6.03 

6.67 

6.28 

4.86 


26.42 

18.49 

16.83 

14.89 

13.49 

18.86 

12.04 

11.19 

10.80 

9.84 


6.12 

4.26 

3.86 

3.63 

3.48 

3.37 

3.23 

3.07 

2.90 

2.71 

9 

10.66 

8.02 

6.99 

6.42 

6.06 

5.80 

6.47 

6.11 

4.75 

4.31 


28.86 

16.89 

18.90 

12.66 

11.71 

11.18 

10.87 

9.67 

8.72 

7.81 


4.96 

4.10 

3.71 

3.48 

3.33 

3.22 

3.07 

2.91 

2.74 

2.54 

10 

10.04 

7.56 

6.65 

6.99 

6.64 

5.39 

6.06 

4.71 

4.33 

3.91 


21.04 

14.91 

12.66 

11.28 

10 . a 

9.92 

9.20 

8.46 

7.64 

6.76 


4.84 

3.98 

3.69 

3.36 

3.20 

3.09 

2.95 

2.79 

2.61 

2.40 

11 

9.66 

7.20 

6.22 

6.67 

6.32 

5.07 

4.74 

4.40 

4.02 

3.60 


19.69 

18.81 

11.66 

10.86 

9.68 

9.06 

8.86 

7.68 

6.86 

6.00 


4.75 

3.88 

3.49 

3.26 

3.11 

3.00 

2.85 

2.69 

2.50 

2.30 

12 

9.SS 

6.93 

6.96 

6.41 

6.06 

4.82 

4 . 6 O 

4.16 

3.^8 

3.36 


18.64 

18.97 

10.80 

9.63 

8.89 

8.88 

7.71 

7.00 

6.26 

6.42 


♦ Table F is reprinted, in rearranged form, from Table V of Fisher and Yates: Statistical 
tables for biological, agricultural and medical research, Oliver and Boyd, Ltd., Edinburgh, by 
permission of the authors and publishers. 


354 


Appendix 


Table F. Table of F for .05 (roman), ,01 (italic) ^ and .001 (bold face) 
Levels of Significance — (Continued) 


\ni 

n2\ 

1 

2 

3 

4 

5 

6 

8 

12 

24 

00 


4.67 

3.80 

3.41 

3.18 

3.02 

2.92 

2.77 

2.60 

2.42 

2.21 

13 

9.07 

6.70 

6.74 

6.20 

4.86 

4.62 

4.30 

3.96 

3.69 

3.16 


17.81 

19.81 

10.91 

9.07 

8.86 

7.86 

7.91 

6.69 

6.78 

4.97 


4.60 

3.74 

3.34 

3.11 

2.96 

2.85 

2.70 

2.53 

2.35 

2.13 

14 

8.86 

6.61 

6.56 

6.03 

4.69 

4.43 

4.14 

3.80 

3.43 

3.00 


17.14 

11.78 

9.78 

8.69 

7.99 

7.48 

6.80 

6.18 

6.41 

4.60 


4.54 

3.68 

3.29 

3.06 

2.90 

2.79 

2.64 

2.48 

2.29 

2.07 

15 

8.68 

6.36 

6.42 

4.89 

4.66 

4.32 

4.00 

3.67 

3.29 

2.87 


16.89 

11.84 

9.84 

8.96 

7.67 

7.09 

6.47 

6.81 

6.10 

4.31 


4.49 

3.63 

3.24 

3.01 

2.85 

2.74 

2.59 

2.42 

2.24 

2.01 

1C 

8.63 

6.23 

6.29 

4.77 

4.U 

4.20 

3.89 

3.66 

3.18 

2.75 


16.19 

10.97 

9.00 

7.94 

7.97 

6.81 

6.19 

6.66 

4.86 

4.06 


4.45 

3.59 

3.20 

2.96 

2.81 

2.70 

2.55 

2.38 

2.19 

1.96 

17 

S.Jfi 

6.11 

6.18 

4.67 

4.34 

4.10 

3.79 

3.46 

3.08 

2.65 


16.79 

10.66 

8.78 

7.68 

7.09 

6.66 

6.96 

6.89 

4.68 

3.86 


4.41 

3.55 

3.16 

2.03 

2.77 

2.66 

2.51 

2.34 

2.15 

1.92 

18 

8.38 

6.01 

6.09 

4.68 

4.26 

4.01 

3.71 

3.37 

3.00 

2.57 


16.88 

10.89 

8.49 

7.46 

6.81 

6.86 

6.76 

6.18 

4.46 

3.67 


4.38 

3.52 

3.13 

2.90 

2.74 

2.63 

2.48 

2.31 

2.11 

1.88 

19 

8.18 

6.93 

6.01 

4.60 

4.17 

3.94 

3.63 

3.30 

2.92 

2.49 


18.08 

10.16 

8.98 

7.96 

6.61 

6.18 

6.69 

4.97 

4.99 

3.62 


4.35 

3.49 

3.10 

2.87 

2.71 

2.60 

2.45 

2.28 

2.08 

1.84 

20 

8.10 

6.85 

4-94 

4.43 

4.10 

3.87 

3.56 

3.23 

2.86 

2.42 


14.89 

9.96 

8.10 

7.10 

6.46 

6.09 

6.44 

4.88 

4.16 

3.38 


4.32 

3.47 

3.07 

2.84 

2.68 

2.57 

2.42 

2.25 

2.05 

1.81 

21 

8.02 

6.78 

4.87 

4.37 

4.04 

3.81 

3.51 

3.17 

2.80 

2.36 


14.69 

9 .n 

7.94 

6.96 

6.39 

6.88 

6.31 

4.70 

4.08 

3.26 


4.30 

3.44 

3.05 

2.82 

2.66 

2.55 

2.40 

2.23 

2.03 

1.78 

22 

7.94 

5.72 

4.82 

4.31 

3.99 

3.76 

3.45 

3.12 

2.75 

2.31 


14.88 

9.61 

7.80 

6.81 

6.19 

6.76 

6.19 

4.68 

8.99 

3.16 


4.28 

3.42 

3.03 

2.80 

2.64 

2.53 

2.38 

2.20 

2.00 

1.76 

23 

7.88 

5.66 

4.76 

4.26 

3.94 

3.71 

3.41 

3.07 

2.70 

2.26 


14.19 

9.47 

7.67 

6.69 

d .08 

6.66 

6.09 

4.48 

8.82 

3.06 


4.26 

3.40 

3.01 

2.78 

2.62 

2.51 

2.36 

2.18 

1.98 

1.73 

24 

7.82 

6.61 

4.72 

4.22 

3.00 

3.67 

3.36 

3.03 

2.66 

2.21 


14.03 

9.84 

7.86 

6.69 

6.98 

6.66 

4.99 

4.89 

8.74 

2.97 


* Table F is reprinted, in rearranged form, from Table V of Fisher and Yates: Statiatical 
tahlea for biological, agricuUwral and medical reaearch, Oliver and Boyd, Ltd., Edinburgh, 
by permission of the authors and publishers. 





Appendix 


355 


Table F. Table of F for .05 (roman), .01 {italic), and .001 (bold face) 
Levels of Significance *— {Continued) 


\ni 

n2\ 

1 

2 

3 

4 

5 

6 

8 

12 

24 

QC 


4.24 

3.38 

2.99 

2.76 

2.60 

2.49 

2.34 

2.16 

1.90 

1.71 

25 

7.77 

6.67 

4.68 

4.18 

3.86 

3.63 

3.82 

2.99 

2.62 

2.17 


13.88 

9.22 

7.40 

6.49 

0.88 

0.46 

4.91 

4.81 

3.66 

2.89 


4.22 

3.37 

2.98 

2.74 

2.59 

2.47 

2.32 

2.15 

1.95 

1.69 

26 

7.22 

6.63 

4.64 

4.t4 

3.82 

3.60 

3.29 

2.96 

2.68 

2.13 


13.74 

9.12 

7.36 

6.41 

0.80 

0.38 

4.83 

4.24 

3.69 

2.82 


4.21 

3.35 

2.96 

2.73 

2.57 

2.46 

2.30 

2.13 

1.93 

1.67 

27 

7.68 

6.40 

4.60 

4.11 

3.78 

3.66 

3.26 

2.93 

2.66 

2.10 


13.61 

9.02 

7.27 

6.33 

0.78 

0.31 

4.76 

4.17 

3.02 

2.76 


4.20 

3.34 

2.95 

2.71 

2.56 

2.44 

2.29 

2.12 

1.91 

1.65 

28 

7.64 

6.46 

4.67 

4.07 

3.76 

3.63 

3.23 

2.00 

2.62 

2.06 


13.60 

8.93 

7.19 

6.20 

0.66 

6.24 

4.69 

4.11 

3.46 

2.70 


4.18 

3.33 

2.93 

2.70 

2.54 

2.43 

2.28 

2.10 

1.90 

1.64 

29 

7.60 

6.42 

4.64 

4.04 

3.73 

3.50 

3.20 

2.87 

2.40 

2.03 


13.39 

8.80 

7.12 

6.19 

0.09 

6.18 

4.64 

4.06 

3.41 

2.64 


4.17 

3.32 

2.92 

2.69 

2.53 

2.42 

2.27 

2.09 

1.89 

1.62 

30 

7.66 

6.3.9 

4.61 

4.02 

3.70 

3.47 

3.17 

2.84 

2.47 

2.01 


13.29 

8.77 

7.00 

6.12 

0.63 

0.12 

4.08 

4.00 

3.36 

2.09 


4.08 

3.23 

2.84 

2.61 

2.45 

2.34 

2.18 

2.00 

1.79 

1.51 

40 

7.31 

6.18 

4.31 

3.83 

3.61 

3.29 

2.99 

2.66 

2.29 

1.80 


12.61 

8.20 

6.60 

6.70 

0.13 

4.73 

4.21 

3.64 

3.01 

2.23 


4.00 

3.15 

2.76 

2.52 

2.37 

2.25 

2.10 

1.92 

1.70 

1.39 

60 

7.08 

4.08 

4.13 

' 3.65 

3.34 

3.12 

2.82 

2.60 

2.12 

1.60 


11.97 

7.76 

6.17 

0.81 

4.76 

4.37 

3.87 

3.31 

2.69 

1.90 


3.92 

3.07 

2.68 

2.45 

2.29 

2.17 

2.02 

1.83 

1.61 

1.25 

120 

6.85 

4.70 

3.96 

3.48 

3.17 

2.96 

2.66 

2.34 

1.95 

1.38 


11.38 

7.31 

0.79 

4.90 

4.42 

4.04 

8.00 

3.02 

2.40 

1.06 


3.84 

2.99 

2.60 

2.37 

2.21 

2.09 

1.94 

1.75 

1.52 

1.00 

DO 

6.64 

4.60 

3.78 

3.32 

3.02 

2.80 

2.51 

2.18 

1.79 

1.00 


10.88 

6.91 

0.42 

4.62 

4.10 

8.74 

8.27 

2.74 

2.13 

1.00 


* Table F is reprinted, in rearranged form, from Table V of Fisher and Yates: Statistical 
tables for biological, agricultural and medical research, Oliver and Boyd, Ltd., Edinburgh, 
by permission of the authors and publishers. 




Index 


Accidental sampling, 332 
Alienation, coefficient of, 112 
Analysis of variance, 235-330 
applications for significance: 
of correlation, linear, 251-255, 
259-262 

of correlation ratio, 249-251, 
259-262 
of differences: 

for correlated means, 274r-276, 
295-299 

for independent means, 240- 
242, 243-245 
of interaction, 287-289 
of multiple correlation, 262-266 
of nonlinearity, 255-258, 259- 
262 

of reliability, 276-280 
assumptions: 

normality for trait, 235, 249 
independent variance estimates, 
235, 239-240, 324 
similar variances for groups, 240, 
249, 262 
classifications: 
double, 270-274 
higher, 316 
simple, 235-236, 267 
triple, 289-295 
computation: 

double classification, 280-282, 
284-287 

groups of unequal size, 248 
simple classification, 242-244 
single group, 222 
triple classification, 299-305 
covariance method, 318-330 
computation, 325-328 
and correlation, 320-322 
degrees of freedom, 320, 325 


Analysis of variance, covariance 
method (Continued) 
regression adjustments, 322-325, 

328 

situations for use, 318-319, 328, 

329 

sums of products, 320 
degrees of freedom, 239, 252, 256, 
263, 272, 284, 294, 320, 325 
error term for F, 279, 288-289 
generalizations from, 311-316 
interaction, 269, 283, 311 
higher, 316-317 
illustrations of, 309-311 
simple, 283-284, 288-289 
triple, 293-294 

significant F, meaning of, 240-242 
sum of squares, 222 
between-groups, 238 
interaction, 283, 293-294 
remainder, 272-273, 283 
as error, 278 

residual, 252, 273, 293, 323 
within-groups, 238 
variance estimates, 219 
between-groups, 238, 240 
interaction, 284, 294 
meaning of, 239-242 
remainder, 273 
as error, 278-279 
residual, 253-254, 273, 293 
within-cells, 284 
within-groups, 238, 239 
Arbitrary origin, 16 
Area sampling, 336 
Arkin, H., 11 
Array, 91, 99 
Attenuation, 134-136 
Attributes, 61-62 



358 


Index 


Average, 1, 15 
Average deviation, 20-21 

Best-fit line, 103-106 
Beta (/(3) coefficients, 147 
Binomial distribution, 40-45 
kurtosis of, 42 
mean of, 42 

and normal curve, 43-44 
and probability, 41-43 
skewness of, 42 
standard deviation of, 42 
Biserial correlation, 167-174 
assumptions, 170, 173 
formulas, 171 
interpretation of, 173 
and point biserial, 173 
sampling error of, 172 
Blakeman criterion, 255 
Brinton, W. C., 11 
Brown-Spearman formula, 132 

Central value (tendency), 12 
mean, 15-18 
median, 14-15 
mode, 13 

Changes, evaluation of: 

for categorical data, 77-82,204-206 
by covariance method, 342-343 
for graduated series, 71-75, 342- 
343 

Chesirc, L., 176 n. 

Chi square (x^)i 186-215 
additive property of, 202, 206 
applications as test: 
of agreement with a priori fre¬ 
quencies, 198 
of changes, 204-207 
of correlation, 201, 210-211 
of goodness of fit, 199, 211-215 
of group differences, 199, 201- 
202, 207-211 

of independence, 199, 201 
assumptions, 197-198 
combining of, 202, 206 
correction to, for continuity, 207 
and critical ratio, 190, 196, 202- 
204,206 


Chi square (x^) {Continued) 
degrees of freedom, 192-193, 213- 
215 

and discontinuity, 189, 198, 207 
distribution of, 187-196 
curves, 195 
empirical, 187-190 
mathematical, 194 
and F, 343-344 

levels of significance, 196-197, 213 
and normal curve, 190, 197, 198 
and null hypothesis, 196-197 
and proportions, 202-204 
table of, 350-351 
Classification, 5-6 
See also under Analysis of variance 
Colton, R. R., 11 
Combined groups: 
mean for, 18 

standard deviation for, 25 
Common elements and correlation, 
117-118 

Comparison of groups, 63 
See also Significance, of differences 
Confidence interval, 59 
Confidence level, 54-59 
Confidence limits, 55-59, 65, 82-83, 
123-124, 221 

Contingency coefficient, 179-183 
and chi square, 182-183, 201, 210- 
211 

corrections to, 182 
sampling error of, 182-183 
upper limits of, 179 
Contingency table, 180, 199, 200, 210 
Continuity, correction for, 207 
Continuous series, 5 
Control variables, 72, 85, 336, 337, 
339 

Correction: 

for attenuation, 135-136 
to contingency coefficient, 182 
for continuity, 207 
for grouping, 24 
for uncontrolled variable, 319 
Correlation and causation, 117, 162 
Correlation between: 
categorized variables, 174-183 



Index 


359 


Correlation between: (Continued) 
correlations, 124-125 
dichotomized and graduated varia¬ 
bles, 167-174 

dichotomized variables, 174-179 
graduated variables, 92, 183-185 
indexes, 136-138 
means, 64, 69 

point variables, 173-174, 179 
standard deviations, 69 
Correlation, factors affecting, 121- 
143 

errors of measurement, 134-136 
heterogeneity, 125-127 
heterogeneity, third variable, 139- 
141 

indexes, 138 
part-whole, 139 
range of talent, 125-127 
sampling errors, 122-124, 229-227, 
251-255 
selection, 121 
Correlation, measures of: 
biserial, 167-174 
contingency, 179-183 
correlation ratio (eta), 183-185,249- 
251, 259-261 
fourfold point, 179 
multiple, 144-165 
See also Multiple correlation 
partial, 140-142, 227 
point biserial, 173-174 
product moment, 90-120 
See also Product moment cor¬ 
relation 

rank-difference, 97-98 
tetrachoric, 174-179 
Correlation ratio (eta), 183-185 
computation, 259-261 
sampling significance of, 249-251, 
259-261 

Covariance, 319 

See also under Analysis of variance 
Crespi, L., 210 
Critical ratio (C/2), 66 
and chi square, 190, 196, 202-204, 
206 

and F, 247, 343-344 


Critical ratio (CR) (Continued) 
and U 223-224, 343-344 
Cumulative frequency distribution, 
8-9 

Curvilinearity, test of, 255-258, 259- 
262 

Decile, 19 

Degrees of freedom: 
in analysis of variance, 239, 252, 
256, 263, 272, 284, 294, 320, 
325 

for chi square, 192-lp3, 213-215 
for F, 229 

for t test for means, 221, 224 
for t test for r, 227 
for variance estimate, 218-220 
Dependent variable, 144, 153 
Descriptive statistics, 1, 12 
Deviation score, 20, 104, 236 
Differences, see Significance, of differ¬ 
ences 

Discontinuity, 189, 198, 207 
Discrete series, 5 
Distribution: 
binomial, 40-44 
chi square, 187-196 
cumulative, 8-9 
F, 229-230 
frequency, 6 
normal, 32-34 
sampling, 51-52 
t, 217-218 

Doolittle method, 156-160 

Elderton’s table of chi square, 197, 
210 

Error: 

absolute, 128 
constant, 127 

in drawing conclusions, 66-69, 232- 
234 

of estimate, 108-112, 149-151 
of measurement, 127-131, 276-280 
relative, 128 
sampling, 60 

See also under Standard error 
sources of, 46, 131, 278-279 



360 


Index 


Error: {CovUinued) 
standard, see Standard error 
t 3 rpe I and type II, 69 
variable, 127, 131 

Estimate, error of, 108^-112, 149-151 
Eta ( 17 ), 183-185 
computation, 259-261 
sampling significance of, 249-251, 
259-261 

Experimental and control data, treat¬ 
ment of: 

matched distributions, 339-340 
own control, 72, 80, 82, 225-226, 
274-276, 295-299 

paired (or matched) cases, 72, 77- 
80, 82, 225-226, 274-276, 295- 
299 

randomly drawn, 65, 76, 223-225, 
243-245 

sibs and littermates, 72, 225-226, 
274-276, 295-299 
Ezekiel, M., 163 

Ff or variance ratio, 229 
and chi square, 343-344 
and critical ratio (C12), 343-344 
degrees of freedom, 229 
distribution, 229 
error term for, 279, 288-289 
for group variances, 230-231 
of independent estimates, 235,239- 
240, 324 

and t, 246-247, 254-255, 275, 343- 
344 

table of, 353-355 
Fiducial limits, 59 
Finite universe, 87 
Fisher, R. A., 28, 67, 123, 220, 228, 
343, 349-355 
Fitting of line, 103-106 
Form vs. form reliability, 132-133 
Fourfold point correlation, 179 
Fourfold table, 79 
and changes, 79, 204-206 
chi square for, 183, 200 
and contingency, 179, 183 
and point correlation, 179, 183 
and tetrachoric r, 175-176 


Frequency, 6 

comparison, see Chi square 

cumulative, 8 

curve, 8 

distribution, 6 

polygon, 7-8 

table, 6 

Goodness of fit: 

of normal curve, 199-200, 211-215 
of regression line. 111 
Graduated series, 5 
Graphic presentation, 7-11 
histogram, 7 
line graph, 10-11 
ogive, 9 
polygon, 8 
Grouping, 5-6 
correction for, 24 
Guessed average, 15-16 

Hansen, M. W., 336 n. 

Hauser, P. M., 336 n. 

Heterogeneity and correlation, 125- 
127, 134, 139-141, 322 
Histogram, 7 
Homoscedasticity, 108 
Hovland, C., 80 

Independence, test of, 199, 201 
Independent variable, 153 
Indexes: 

correlation of, 130-138 
mean of, 137 

standard deviation of, 137 
Inference, statistical, 2 
Interaction, 269, 283-284, 288-289, 
293-294, 309-311, 316-317 
Intervals, grouping, 5-6 
limits of, 6-7 
midpoints of, 7, 13 
size of, 6 

Kelley, T. L., 156, 182 
Kendall, M. G., 59 n. 

Kurtosis, 26-30 

Level of confidence, 54-59 



Index 


Level of significance, 66-69, 196-197, 
213, 230-231, 233-234 
Line graph, 10-11 
Linearity of regression, 103 
test for, 255-258, 259-262 

McCall, W. A., 38 
McNemar, Q., 331 n. 

Matched groups by means of; 
matched distributions, 339-340 
paired cases, 72, 85, 337-338 
randomization, 338 
siblings and twins, 72, 341 
Mean, 15 

for combined groups, 18 
computation, 16 
sampling error of, 51-52, 217 
Mean difference, significance of, 73- 
75, 225-226 

Measurement error, 127-131,276-280 
Median, 14-15 
Midpoint of interval, 7, 13 
Mode, 13 
Moments, 26-27 
Moving averages, 8 
Multiple correlation, 150-151 
and determinants, 155-156 
and diminishing returns, 163 
Doolittle method, 156-160 
error of estimate, 149-151 
interpretation of, 151 
limitations, 161-163 
notation, 164-165 
numerical solution, 156-160 
regression equations, 147-149, 153 
relative weights in, 151-152 
sampling error of, 160-161, 262-266 
selection fallacy, 162 
and shrinkage, 161 
and suppressant variable, 163-164 

Nonlinearity, test of, 255-258, 259- 
262 

Normal correlation, 118-119 
Normal distribution curve, 32 
area under, 34 
equations for, 32, 34 
and probability, 43-45 


361 

Normal distribution curve (Con^ 
tinued) 

table of, 346-347 
unit form of, 34 

Notation, 164-165, 220-221, 236, 
270, 290 

Null hypothesis, 65, 196-197 
Ogive, 9 

Paired cases, 72, 85, 337-338 
Partial correlation, 140-141 
sampling error of, 142, 227 
Part-whole correlation, 139 
Paterson, D. G., 157 n. 

Pearson, K., 31, 175, 197, 220 
Percentage, sampling error of, 62 
Percentile, 19 

Point biserial correlation, 173-174 
Prediction, error of, 108-112,149-151 
Probability, 39 
addition theorem, 39 
and binomial, 40-42 
multiplication theorem, 39 
and normal curve, 43-45 
Probable error, 89 
Product moment correlation, 92 
assumptions, 103, 108, 113-114, 
116-117, 117, 118, 120 
computation, 94-96 
direction of, 111-112 
interpretations in terms of; 
common elements, 117-118 
error of estimate, 108-113 
normal surface, 118-119 
rate of change, 107 
variance explained, 114-117 
limits for, 119, 142 
and prediction, 103-104, 107 
and regression, 107 
sampling error of, 122-124, 226- 
227, 251-255 

scatter diagram, 91-93, 90-102 
Proportion, sampling error of, 63 
Purposive sampling, 333 

Quartile, 18-19 
Quartile deviation, 18-19 



362 


Index 


Random sampling, 51, 332 

Randomization, 338 

Range, 6, 18 

Rank-difference correlation, 97-98 

Regression, 107 
coefficients, 107, 151 
equations, 106-107, 147, 149, 153 
test of linearity of, 255-258, 259- 
262 

ReUability, 127-134, 27&-280 
and attenuation, 134-136 
coefficient of, 128 
error of measurement, 130 
form vs. form, 132-133 
range, effect of, 134 
significance of, 276-280 
split-half, 132 
test-retest, 132 

Renshaw, M. J., 285 n. 

Residuals, 115,150,252,273,293,323 

Saffir, M., 176 n. 

Sampling, 46 
distribution, 51 
of chi square, 187-196 
of differences, 64 
empirical demonstration of, 46- 
50, 187-190 
of F, 229 
of t, 217-218 

errors, reduction of, 84r-86, 334-336 
for experimental and control 
groups, 84-85, 336-341 
from finite universe, 87-88 
representativeness of sample, 86-87 
size of sample required, 85-86 
from skewed universe, 88 
small samples, 216 
successive, 51 
techniques, 331-336 
accidental, 332 
area, 336 
purposive, 333 
random, 51, 332 
stratified, 333-336 
theory, 51-52 
variance, 52 

Scatter diagram, 91-93, 99-102 


Series, 5 

Sheppard^s correction, 24 
Shrinkage of multiple r, 161 
Significance, 65 

of correlation, 122-124, 226-227, 
251-255 

of correlation ratio, 249-251, 259- 
261 

of differences, 65-68 
for correlations, 124-125 
for means, correlated, 64, 71-75, 
225-226, 274-276, 295-299, 
339 

for means, independent, 65, 223- 
225, 240-242, 243-245, 342- 
343 

for means, sub- vs. total group, 
88 

for proportions, correlated, 77-82 
for proportions, independent, 75- 
77 

for standard deviations, 70, 228- 
231 

for variances, 228-231 
and erroneous conclusions, 67-69, 
232-234 

of interaction, 287-289 
levels, 66-69, 196-197, 213, 230- 
231, 233-234 

of multiple r, 160-161, 262-266 
of nonlinearity, 255-258, 259-262 
of reliability, 276-280 
and simple hypotheses, 83 
Skewness, 26-27 
of binomial distribution, 42 
causes of, 29-30 
of sampling distributions: 
of correlations, 123 
of proportions (or percentages), 
62 

of standard deviations, 217, 228 
Small sample treatment, 216 
of correlation, 226-227 
of difference: 

for correlated means, 225-226 
for independent means, 223-225 
for variances, 228-231 
of single mean, 221 



Index 


363 


Smoothing, 8 
Snedecor, G. W., 229 
Spearman-Brown formula, 132 
Split-half reliability, 132 
Spurious correlation, 138, 139 
Standard deviation, 21 
for combined groups, 25 
computation, 21-25 
sampling error of, 60 
Sheppard’s correction, 24 
Standard error, 52, 59-60 
of average deviation, 60 
of correlation measures: 
biserial, 172 
multiple, 160 
product moment, 122 
tetrachoric, 177-178 
z (transformed r), 123, 142 
of kurtosis, 61 
of mean, 52, 60 
from finite universe, 87 
for stratified sample, 334-335 
of mean difference, 74-75 
of median, 60 
of percentage, 62 
of proportion, 63 
from finite universe, 87 
for stratified sample, 334 
of quartile deviation, 60 
of skewness, 61 
of standard deviation, 60 
of z (transformed r), 123, 142 
Standard error of difference, 65, 69 
for correlations, 124 
by way of 2’s, 124, 125 
for means, correlated, 64, 74, 339, 
340 

for means, independent, 65 
for means, sub- vs. total group, 88 
for medians, 70 
for proportions, correlated, 80 
for proportions, independent, 76 
for standard deviations, 70 
for s’s (transformed r’s), 124, 125 
Standard error of estimate, 108-112, 
149-151 

Standard error of measurement, 127- 
131, 276-280 


Standard score, 33, 36-38 
and T score, 38 
Stratified sampling, 333-336 
Student,” 341 
Successive sampling, 51 
Sum of squares, 222 
See also under Analysis of variance 
Suppressant variable, 163-164 

t ratio, 216 

assumptions and limitations in use 
of, 216, 225, 231-234 
and confidence limits, 221 
for correlation, 226-227 
and critical ratio (C/2), 223-224 
degrees of freedom, 218-219, 227 
for difference: 

in correlated means, 225-226 
in independent means, 223-225 
distribution of, 217-218 
and F, 246-247, 254-255, 275, 343- 
344 

for single mean, 217, 221 
table of, 352 
T score, 38 
Tabulation, 5, 91 
Taubman, R. E., 281 n. 

Test-retest reliability, 132 
Tetrachoric correlation, 174-179 
computing diagrams for, 176-177 
formula, 176 

sampling error of, 177-178 
Thurstone, L. L., 176 n. 

True score, 128 

Variance, 21 

additive nature of, 114-115 
computation, 21-25, 222 
and correlation, 114-117, 151 
of difference, 114-115 
estimates of, 219, 224-225 
ratio, see F, or variance ratio 
sampling, 52 
of sum, 114-115 
theorem, 114-115 
See also Analysis of variance 
Variation, 12, 18 
average deviation, 20-21 



364 


Index 


Variation {Continued) 
coefficient of, 136 
quartile deviation, 18 
standard deviation, 21-26 

Walker, E. L., 309, 311 

Wilks, S. S., 339 

Wright, Suzanne T., 243, 267 


Yates’s correction for continuity, 207 

z, for difference between stanc^d de¬ 
viations, 228-229 
z score, 33, 36-38 
z transformation for r, 123-124 
tables of, 348, 349 




